Finding the number of common users between two websites

Finding the number of common users between two websites - google-analytics

There are two Swiss (.ch) websites, let's call them A and B. A is owned by me and B by a customer.
Because of legal data protection issues B is hosted in Switzerland and not allowed to store any user information abroad. Which means that software like Google Analytics is not available on B. A is a Swiss website but hosted in a (European) cloud.
Now we would like to find out how many common users we both have over the duration of 30 days. In short:
numberOfUsersA ∩ numberOfUsersB
For the sake of simplicity: Instead of users we are perfectly happy to measure common browsers.
What would you suggest is the simplest way to solve this problem?

First off all, best regards from Zurich/Zug :) Swiss people are everywhere...
I don't think you're correct that it's not legal to collect data in Switzerland at all (also abroad). As I'm working in the financial industry I know this topic very well and we also had to do a lot research to use GA at all.
It's always the question what and how you collect data. What you can't do - beside you got in upfront the permission of the user - is storing personal identifiable information. That's anyway not allowed by GA - you can't import/save in custom dimension/metrics for example email addresses.
Please check https://support.google.com/adsense/answer/6156630?hl=en as general basic information about this topic.
If you save the IP addresses via IP anonymization, you shouldn't run into problems if you're declaring this in your data-privacy statements. Take this approach: https://support.google.com/analytics/answer/2763052?hl=en
I'm not a lawyer and also not want to give you legal advises, but ours told us that's fine. If you are real paranoid about sending data to the USA - like we have to be - you can exclude your tracking from very sensitive forms.
To go back to your basic question, if you want to find this out via Google Analytics, your key is "cross domain tracking". Check https://support.google.com/analytics/answer/1034342?hl=en for more information in this direction.
The only work-around I have in my mind beside this, is if you start collecting browser-fingerprints yourself and then connect both collections over the finger prints together (that's not save, as your visitors will use more than one device/configuration). I personally would go for the IP anonimization, exclude very sensitive forms and ensure that your data-privacy declaration contains all necessary parts for and offer an opt-out option then you should be on the safe side.
All the best and TGIF :)

Related

Using Google Tag Manager to track personalised data

I am considering using Google Tag Manager to track abandoned forms. I note that collecting data that "personally identifies an individual (such as a name, email address or billing information)" is against their Terms of Service.
However, I am whether Google actually enforce this policy, and if so, how?
P.s I not actually looking to implement this - merely curios whether and how it is actually enforced.

It's not enforced unless people write to Google. There are many tracking mistakes on various sites when PII data gets tracked and no one does anything about it for years, even decades. Google doesn't actively check the content of your data to find PII, but it will investigate and take action if people send complains.
Corp lawyers and tracking specialists are usually extremely careful around gathering PII data without explicit consent. Well, maybe gigantic corps would be an exception since they know exactly how to monetize PII.
Anyhow, the real issue here is that you rarely (if ever) want to know what they type for analytics reasons. You want to know what fields they interacted with, how many times they tried to submit the form, what kind of errors they've got before abandoning. That allows for plenty analysis aimed at the form performance improvement.

Intent Data - How exactly are traceable urls used to track interest in b2b topics?

I've been doing some research on intent data and I have some technical questions, especially about how two businesses might be collecting "contact level" i.e. personally identified web traffic details without using third-party cookies.
Some quick background: Most of the large providers of intent data (bombora, the big willow/aberdeen/Spiceworks Ziff Davis, Tech Target etc.) offer "account" based intent data - essentially when users visit websites in their network, they do a reverse IP addresses lookup, match them to know IP addresses of large companies (usually companies with at least 250 employees) and note what topics are "surging" - aka showing unusual traffic on a given week. This largely makes sense to me. I'm assuming that when a visitor shows up at your site, google analytics and similar tools can tell you what google search keywords were used to arrive at your site, and that's how they can say things like - we can "observe intent signals across an unlimited number of contextual keyword categories, allowing you to customize your keywords and layer these insights onto your campaigns for optimal performance." Third party cookies, and data from DSP's (demand side platform's enabling ad buyers to buy ads across many platforms) are also involved in providing data, those these will be less useful sources of data after google sunset's third party cookies on Chrome.
Two providers - intentdata.io, and intentflow.com are offering contact level intent data. You can imagine why that would be of interest - if the director of sales is interested in your sales SaaS tool, you have a better idea of how qualified that lead is and who to reach out to. Only one of the two providers is specific about what exactly they're collecting - i.e. what "intent" they are capturing and how they're collecting it.
Intentdata.io:
Intentdata.io looks like a tiny company (two employees on LinkedIn). The most specific statement I've found about what their data is was in an Impact+ podcast interview - Ed, the CRO at intentdata.io, mentions that the data is analogous to commenting on a Forbes article or a conversation on LinkedIn. But he's clear - "that's just an analogy." They also say elsewhere that the data they provide mentions specifically what action the contact took that landed them in the provided data.
Ed from intentdata.io is also asked about GDPR compliance in his Impact+ interview - he basically says, some lawyers will disagree but he believes their data to be GDPR compliant, and it is in use by some firms in the EU. He does mention though that some firms have asked them to exclude certain columns from the data, like email addresses.
Edit: Found a bit more on intentdata.io - looks like they build a custom setup to pull "intent" data for each customer - they don't have a database monitoring company interaction with content across social media and b2b sites, instead you provide them with "lists (names and URLs) of customers, competitors, influencers, events, target accounts and key terms that would indicate intent at different stages in the buying journey. Pull together important hashtags, details on your ideal buyer (job titles, functions, seniority) and firmographics (size, industry, location)" - then they create a custom "algorithm" from this info, and they iterate on that "algorithm" a little bit over time.
They also make this statement on their site: "IntentData.io's data is collected from observing public actions that users are taking around the web. That means that first, we observe action (not reading, searching, browsing, being shown an ad, etc.) which we believe is a more concrete manifestation of intent. Second, people are taking these actions publicly for the world to see. We do not use any cookies, bidstream data or reverse IP lookups."
Finally one piece of their sales collateral asks: What ad budget do you have for PPC nurturing ads? So their may be some targeted PPC ads involved in the "algorithm."
Edit 2: Their sales collateral also states that they use "a third-party intent data methodology that uses multi-variable linear regression analysis to correlate observed actions with a specific contact. This is the method that the LeadSift engine of IntentData.io data uses."
Intentflow.com:
Intentflow.com seems like the sketchier of the two providers if I'm honest. They provide a video walkthrough of how they get their data at intentflow.com/thesis - but I'm not following how using "traceable urls" with no cookies involved, could give you contact level information. They also say they lookup what the most popular articles/pages are for 5k to 40k unique keywords or phrases that are related to 10-50 keywords or phrases you give them to target. And they use "traceable urls" to track who visits those sites. Again - no cookies involved. Supposedly fully compliant at least with US laws. They don't provide data for the EU "by design" so presumably they're not GDPR compliant? They also claim they can identify the individuals who are visiting your website, again using "traceable urls" - it seems clear from the pitch that you're asked to reach out to your backlink providers around the web to use this traceable url.
I've seen an interview where a rep from Bombora says they tried for a while to do contact level intent data and it wasn't very useful - and it wasn't really doable in a compliant way. Ed seems to be aware they've said that publicly, and he says "that's just not true."
So what's going on here? How exactly are these two small firms getting contact level intent data? Do you think they're doing it in a compliant way?

Got more information:
Intentdata.io use public comments, likes, shares etc. on blogs, social posts via web crawling and scraping for events, influencers, hashtags, articles etc. that the customer deems worth tracking. They do some work to try and connect the commenters with an identifiable contact. They bill on a quarterly basis for this.
Intentflow.com doesn't seem to use "traceable urls" at all. They take bidstream data, and identify the individual visitors via an "identity graph." They provide a minimum of 5k contacts per month at $2 per contact, making their data very expensive ($120k+ per year). You can't get lower than however many contacts their system spits out per month so it seems like there's not a good firm limit on what you will be charged. They say they can identify ~70% of web traffic, and they only provide data on US site visitors. Each row of their output would include not just the contact, but the site that contact was shown an ad on. Definitely interesting data - but I'm guessing they will be very affected by upcoming changes to third party cookies, privacy laws, etc.

How to determine demographics of users visiting your site?

Ad-Servers seem (and do) know a lot about the use who is visiting a certain webpage leveraging Behavioral and Contextual Targeting. I would love to be able to keep track of that data as well. In particular I would like to know:
age range
male/female
geographical info
I would like this information on a per request basis (not a daily summary)
What is the best way to accomplish this?
Thanks!

There are vendors who specialize in characterizing your Site's traffic. Very roughly they work by finding the closest match to your Site from among a large population of Sites in which they do in fact have detailed demographic data. To improve the matching, some of them give you a javascript snippet to insert into your Site's pages to collect user data and send it to their servers (more or less like web analytics code).
Quantcast is such vendor. The link i included will take you to their page that displays sample audience demographic reports.
Crowd Science is another.
Neither of these are free (though they might have a freemium service, i don't know.
Alexa, on the other hand, is free and offers similar data; just enter your Site's url in their textbox, then when you get the results page, select the Audience tab.

Age and Gender: Ask your users.
Geographical Info: Use GeoIP targeting.

You can try Hitwise, but it's a little on the pricey side IIRC

Doug's is a good answer, but Google Analytics now gives you this too, based on their acquisition of DoubleClick. So it's free.
Google Analytics Demographics & Interests
Note that no matter who you get this information from, the information is based on cross-site information. This is based on "third party cookies" which many users turn off (sometimes without knowing they are doing this) depending on their browser's security/privacy settings.

When is Google Analytics not good enough? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to determine why an enterprise wouldn't want to use Google Analytics.
Here are the main reasons I've seen mentioned:
Inability to track clients that have Javascript disabled.
Lack of ownership of the statistics - Google owns the data.
Most of the web clients with Javascript disabled will probably be bots/spiders. This data is interesting, but probably not very useful.
As for the ownership issue, this is a bit paranoid IMO.
What am I missing here? When is Google Analytics not good enough?

Here are my findings from additional research:
Google Analytics is limited to 5 million page views per month - source
If a web site generates more than 5 million pageviews per month it will need linked to an active AdWords account to avoid interruption of service.
Lack of / slow technical support
All Google support is handled through email and response times can take a week or more. Commercial analytics products often have much faster & personalized support.
Inability to track files (PDF's, Images, etc.)
GA relies on Javascript and files lack the ability to execute Javascript. The workaround to this problem is to tag the link, but this won't track requests that go directly to the file.
Limited ability to customize
This is a selling point that I see pushed by commercial analytics tools (WebTrends). However it's never explained what customizations are denied by GA but allowed by WebTrends.

The Google Analytics EULA does not allow you to track individual users by identifying them. So if you wanted to add a custom variable for username to track how many times each user logs in, then you would be in a gray zone if not outright violating the EULA.
I use Google Analytics on about 10 sites right now and it's a great tool. In addition to all the analytics stats, you can tie it in with AdSense and it becomes a marketing/revenue tool and not just "wow look at all these cool user stats". If there was a way to track by user ID in certain circumstances (e.g. if user's agreed to it, or if they work for the company that owns the site) then I would have no issues.
Besides, it's free and all you have to do is add JavaScript to the files, so give it a try and see what you think after a few months.

One reason that was, surprisingly, not posted:
timing / speed of reaction
It takes at least 4 hours (up to 24) for GA to update your data.
This is ok for me personally in most of the cases, but when reacting fast is crucial (news sites, one-off events, etc.) you may want to employ some other solution (Mint comes to mind, but it's not the only one out there of course).

Thought I'd add my two pence worth to this thread, as this a topic close to my heart and one I've debated with colleagues for years. We've used webtrends in house for as long as i can remember, back to version 4 of the log analyzer (how different things were back then!). Since Google Analytics came along, we've started to come under increasing pressure from certain parts of our business to switch, as 'it does everything we need form an analytics tool'
Well, true in many senses it does, especially these days. But I championed the integration of our CRM and web analytics tools back in 2006, and as our business isn't e-commerce (the 'conversion' happens offline, sometimes months after the visitor acquisition) we need to integrate in this way to get a true picture of campaign effectiveness, and notion of ROI.
All of this means, we need access to the raw data, need to be able to join visitor records on sessionID etc, without this access we'd be screwed. I'd love it if we could roll without it, but the current requirements mean we can't, so this alone is a HUGE reason why Google analytics is not good enough.
Over and out

For tracking desktop software or creating a whitelabel solution there are better solutions.
For white label an integration based analytics, i use MixPanel. For Desktop Software, i use Deskmetrics

Google Analytics does not work well with mobile phones. While the iPhone and the Palm may be supported, many of the existing handsets do not support the javascript that Google uses.

If you're based in the UK, then theoretically you could be breaking the Data Protection Act by using Analytics.
If information about your users (like which web pages they're looking at) goes "outside the European Economic Area" and onto Google's servers in the US, then you're breaking the DPA.
Pretty obscure, but you did ask :)
Piwik avoids the problem because you host it on your own servers.

Lack of ownership of the statistics - Google owns the data.
... As for the ownership issue, this is a
bit paranoid IMO.
One problem with it is that we can't even access the raw data. We had a use case this week where we wanted a visitor map for an executive presentation. We needed to get more flexible with how the visitor map is displayed (wanted to view the map in Google Earth plug-in). In GA, you can't. You take what they give you. You can see a map of how many visits came from each city, but you can't export a data file of cities and number of visits, to run the data through other tools. So, paranoia aside, there are significant limitations on what you can accomplish with GA.
However this is not a problem if you use Urchin, the self-hosted version of GA: you can export the data and do what you want with it. (And the exported data is richer than the web server log's, as it includes some analysis already.)
Since Piwik is open source, and pluggable, I imagine you could enhance the visitor map plug-in any way you wanted to. And export whatever data you want.
Whether this limitation affects you depends on your needs, obviously.
Update: I've now looked at the GA Data Export API, and it turns out that things you cannot do through the UI (as you can with Urchin), you can do with this API. It does look like you can export the visit data I was talking about, via a feed (although there are daily traffic caps on those requests). So sprinkle salt heavily on what I wrote above.

A couple more points that I've come across:
GA doesn't let you dig beyond full-day statistics; I would often like the ability to investigate whether a traffic dip the previous day was caused by the design update I did at 1pm or the soccer match on TV at 8pm.
GA doesn't offer a workaround for traffic spikes caused by DDoS attacks, Slashdotting etc. When I'm looking at a GA visitor graph of 2009, all I can see is the 2-million-pageview-spike on October 16th, pushing the entire rest of the year down flat against the horizontal axis of the graph. To get a meaningful graph, GA should offer the ability to trim or exclude outlying data points, or the ability to limit/bracket the graph window itself
GA doesn't have an event monitoring client (think Reinvigorate's Snoop tool)

While GA is very user-friendly, I've found it's not as granular as some of the other stats programs (or maybe I'm not looking in the right places). Before the marketing monkeys I work with began pushing GA, we were very satisfied with AWStats. The sheer scope of the data helped us on several occasions hone sites to better suit their audience. While GA is very shiny and laid out well, I personally still prefer the raw numbers like I used to get through AWStats.

Slow data processing speed - Can be as low as 15-30 mins for page views, but may be up to 48 for eCommerce
EULA is limiting in some cases
You won't own or have any control of the data. Google's engineers might use it (anonymously) for testing
Anything more complex requires customization - Downloads and such care of no issue, but there are limits
Cross domain tracking by linker is faulty at best
Visit based - Proper tools are based on Visitor level, GA works on Visit based reporting mostly
Limited number of custom vars used at one time (5)
No tech support, if you're realistic
Usually when there is a downtime notice, it's already gone
API limitations (4 dimensions and 10 metrics at one time, not all can be used together in addition to that)
I have many more, but at the end of the day it is a good tool for it's price.

From the non-technical point, I think the most important is that some enterprise has the high level data security policy. All of the data should be controlled and managed by themselves.
If you use the Google analytics,the data is stored in google's server. For some special enterprise, like insurance, financial company. The policy should be followed.

I would NOT go with server logs. In fact I have them disabled on my server. Why you ask me?
For the simple reason that everytime you hit my server that stupid logging program makes an entry in the physical log file on my HDD. So if my server gets 100,000 hits in a day that's 100,000 time a HDD write operation happens.
You think that's cool? Well it's not. It's slowing your server down, specially if the log file is huge.
Why would someone even consider doing that to their server? Specially when we're working so hard to minify javascript, css and make image files 2 KB smaller!
Please do yourself a favor don't log directly on your server.
At least Google Analytics logs it on Google's server so my server's healthier.

I wouldn't use it for any of my sites, because you're forcing the user to accept your proprietary JavaScript code in their browser, which is bad. Also, giving your data is Google is a really bad idea.
See Piwiki for something you can run yourself as in free software, eliminating both of the problems.

Is there a reliable way to prevent cheating in a web based contest where anonymous users can vote?

I'm working on a web-based contest which is supposed to allow anonymous users to vote, but we want to prevent them from voting more than once. IP based limits can be bypassed with anonymous proxies, users can clear cookies, etc. It's possible to use a Silverlight application, which would have access to isolated storage, but users can still clear that.
I don't think it's possible to do this without some joker voting himself up with a bot or something. Got an idea?

The short answer is: no. The longer answer is: but you can make it arbitrarily difficult. What I would do:
Voting requires solving a captcha (to avoid as much as possible automated voting). To be even more effective I would recommend to have prepared multiple types of simple captchas (like "pick the photo with the cat", "what is 2+2", "type in the word", etc) and rotate them both by the time of the day and by IP, which should make automatic systems ineffective (ie if somebody using IP A creates a bot to solve the captcha, this will become useless the next day or if s/he distributes it onto other computers/uses proxies)
When filtering by IP you should be careful to consider situations where multiple hosts are behind one public IP (AFAIK AOL proxies all of their customers through a few IPs - so such a limitation would effectively ban AOL users). Also, many proxies send along headers pointing to the original IP (like X-Forwarded-For), so you can take a look at that too.
Finally, using something like FSO (Flash Shared Objects - "Flash cookies") is obscure enough for 99.99% of the people not to know about. Silverlight is even more obscure. To be even sneakier, you could buy an other domain and set the FSO from that domain (so, if the user is looking for FSO's set by your domain, they won't see any)
None of these methods is 100%, but hopefully combined they give you the level of assurance you need. If you want to take this a level higher, you need to add some kind of user registration (which can be as simple as asking a valid e-mail address when the vote occurs and sending a confirmation link to the given address and not counting the votes for which the link wasn't clicked - so it doesn't need to be a full-fledged "create an account with username / password / firs name / last name / etc").

No, you can't, and it only takes one person and a willing forum to change the outcome of an online vote.
You have to realize the inherent flaws of an online vote and rather than attempting to get around them try to use them to your advantage.
-Adam

You can certainly make it difficult.
What about building a user profile with such things as ip address, browser useragent, machine name, and whatever other information you can get.
Store the profile for each user, then if you receive a profile which is similar enough to one already in the database (you'll have to tweak that) you can throw out that vote.
I imagine you can probably build a better profile using silverlight, though I'm not sure what information that gives you access to.

Client-side solutions are out for the reasons you listed -- they can be manipulated by the user. Server-side solutions -- as you said -- can be fooled and bypassed.
If you're willing to accept the fact that you can't really be 100% sure that you're getting exactly one vote per person, then there are some measures you can take to reduce the noise.
Use a CAPTCHA in your vote-submission form to make it harder for bots and scripts to vote.
Limit the number of votes per IP address to one.
Consider requiring registration in order to vote. (I know this defeats part of your original question, but it gives you a greater degree of control over the voting.)
That's a good start.

my personal experience in contest developing and monitoring tells me that no, there is no reliable way to avoid cheating if you let anonymous users vote (or do anything that lets them participate in the contest).
you could play with IP, introduce delays between an action and the next, but it's really difficult: the best way is introduce a captcha or something similar, if applicable in your particular situation.
best of all, don't let anonymous users participate: let them "play" and access to a simulation, but the contest needs a login.

Nope, it's the user's computer and they're in control.
Unfortunately the only solution is to bring it back on your court so to speak and require authentication.
However, a CAPTCHA helps limit the votes to human users at least.
Of course even with authentication you can't enforce single voting because then they teach the bots to register...

I have to agree that the short answer is no...though if you look at my recent answer here: How to anonymously identify a user and store that information you certainly can get it within a 6 percent margin of error.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex