Intent Data - How exactly are traceable urls used to track interest in b2b topics? - google-analytics

I've been doing some research on intent data and I have some technical questions, especially about how two businesses might be collecting "contact level" i.e. personally identified web traffic details without using third-party cookies.
Some quick background: Most of the large providers of intent data (bombora, the big willow/aberdeen/Spiceworks Ziff Davis, Tech Target etc.) offer "account" based intent data - essentially when users visit websites in their network, they do a reverse IP addresses lookup, match them to know IP addresses of large companies (usually companies with at least 250 employees) and note what topics are "surging" - aka showing unusual traffic on a given week. This largely makes sense to me. I'm assuming that when a visitor shows up at your site, google analytics and similar tools can tell you what google search keywords were used to arrive at your site, and that's how they can say things like - we can "observe intent signals across an unlimited number of contextual keyword categories, allowing you to customize your keywords and layer these insights onto your campaigns for optimal performance." Third party cookies, and data from DSP's (demand side platform's enabling ad buyers to buy ads across many platforms) are also involved in providing data, those these will be less useful sources of data after google sunset's third party cookies on Chrome.
Two providers - intentdata.io, and intentflow.com are offering contact level intent data. You can imagine why that would be of interest - if the director of sales is interested in your sales SaaS tool, you have a better idea of how qualified that lead is and who to reach out to. Only one of the two providers is specific about what exactly they're collecting - i.e. what "intent" they are capturing and how they're collecting it.
Intentdata.io:
Intentdata.io looks like a tiny company (two employees on LinkedIn). The most specific statement I've found about what their data is was in an Impact+ podcast interview - Ed, the CRO at intentdata.io, mentions that the data is analogous to commenting on a Forbes article or a conversation on LinkedIn. But he's clear - "that's just an analogy." They also say elsewhere that the data they provide mentions specifically what action the contact took that landed them in the provided data.
Ed from intentdata.io is also asked about GDPR compliance in his Impact+ interview - he basically says, some lawyers will disagree but he believes their data to be GDPR compliant, and it is in use by some firms in the EU. He does mention though that some firms have asked them to exclude certain columns from the data, like email addresses.
Edit: Found a bit more on intentdata.io - looks like they build a custom setup to pull "intent" data for each customer - they don't have a database monitoring company interaction with content across social media and b2b sites, instead you provide them with "lists (names and URLs) of customers, competitors, influencers, events, target accounts and key terms that would indicate intent at different stages in the buying journey. Pull together important hashtags, details on your ideal buyer (job titles, functions, seniority) and firmographics (size, industry, location)" - then they create a custom "algorithm" from this info, and they iterate on that "algorithm" a little bit over time.
They also make this statement on their site: "IntentData.io's data is collected from observing public actions that users are taking around the web. That means that first, we observe action (not reading, searching, browsing, being shown an ad, etc.) which we believe is a more concrete manifestation of intent. Second, people are taking these actions publicly for the world to see. We do not use any cookies, bidstream data or reverse IP lookups."
Finally one piece of their sales collateral asks: What ad budget do you have for PPC nurturing ads? So their may be some targeted PPC ads involved in the "algorithm."
Edit 2: Their sales collateral also states that they use "a third-party intent data methodology that uses multi-variable linear regression analysis to correlate observed actions with a specific contact. This is the method that the LeadSift engine of IntentData.io data uses."
Intentflow.com:
Intentflow.com seems like the sketchier of the two providers if I'm honest. They provide a video walkthrough of how they get their data at intentflow.com/thesis - but I'm not following how using "traceable urls" with no cookies involved, could give you contact level information. They also say they lookup what the most popular articles/pages are for 5k to 40k unique keywords or phrases that are related to 10-50 keywords or phrases you give them to target. And they use "traceable urls" to track who visits those sites. Again - no cookies involved. Supposedly fully compliant at least with US laws. They don't provide data for the EU "by design" so presumably they're not GDPR compliant? They also claim they can identify the individuals who are visiting your website, again using "traceable urls" - it seems clear from the pitch that you're asked to reach out to your backlink providers around the web to use this traceable url.
I've seen an interview where a rep from Bombora says they tried for a while to do contact level intent data and it wasn't very useful - and it wasn't really doable in a compliant way. Ed seems to be aware they've said that publicly, and he says "that's just not true."
So what's going on here? How exactly are these two small firms getting contact level intent data? Do you think they're doing it in a compliant way?

Got more information:
Intentdata.io use public comments, likes, shares etc. on blogs, social posts via web crawling and scraping for events, influencers, hashtags, articles etc. that the customer deems worth tracking. They do some work to try and connect the commenters with an identifiable contact. They bill on a quarterly basis for this.
Intentflow.com doesn't seem to use "traceable urls" at all. They take bidstream data, and identify the individual visitors via an "identity graph." They provide a minimum of 5k contacts per month at $2 per contact, making their data very expensive ($120k+ per year). You can't get lower than however many contacts their system spits out per month so it seems like there's not a good firm limit on what you will be charged. They say they can identify ~70% of web traffic, and they only provide data on US site visitors. Each row of their output would include not just the contact, but the site that contact was shown an ad on. Definitely interesting data - but I'm guessing they will be very affected by upcoming changes to third party cookies, privacy laws, etc.

Related

Using Google Tag Manager to track personalised data

I am considering using Google Tag Manager to track abandoned forms. I note that collecting data that "personally identifies an individual (such as a name, email address or billing information)" is against their Terms of Service.
However, I am whether Google actually enforce this policy, and if so, how?
P.s I not actually looking to implement this - merely curios whether and how it is actually enforced.
It's not enforced unless people write to Google. There are many tracking mistakes on various sites when PII data gets tracked and no one does anything about it for years, even decades. Google doesn't actively check the content of your data to find PII, but it will investigate and take action if people send complains.
Corp lawyers and tracking specialists are usually extremely careful around gathering PII data without explicit consent. Well, maybe gigantic corps would be an exception since they know exactly how to monetize PII.
Anyhow, the real issue here is that you rarely (if ever) want to know what they type for analytics reasons. You want to know what fields they interacted with, how many times they tried to submit the form, what kind of errors they've got before abandoning. That allows for plenty analysis aimed at the form performance improvement.

Finding the number of common users between two websites

There are two Swiss (.ch) websites, let's call them A and B. A is owned by me and B by a customer.
Because of legal data protection issues B is hosted in Switzerland and not allowed to store any user information abroad. Which means that software like Google Analytics is not available on B. A is a Swiss website but hosted in a (European) cloud.
Now we would like to find out how many common users we both have over the duration of 30 days. In short:
numberOfUsersA ∩ numberOfUsersB
For the sake of simplicity: Instead of users we are perfectly happy to measure common browsers.
What would you suggest is the simplest way to solve this problem?
First off all, best regards from Zurich/Zug :) Swiss people are everywhere...
I don't think you're correct that it's not legal to collect data in Switzerland at all (also abroad). As I'm working in the financial industry I know this topic very well and we also had to do a lot research to use GA at all.
It's always the question what and how you collect data. What you can't do - beside you got in upfront the permission of the user - is storing personal identifiable information. That's anyway not allowed by GA - you can't import/save in custom dimension/metrics for example email addresses.
Please check https://support.google.com/adsense/answer/6156630?hl=en as general basic information about this topic.
If you save the IP addresses via IP anonymization, you shouldn't run into problems if you're declaring this in your data-privacy statements. Take this approach: https://support.google.com/analytics/answer/2763052?hl=en
I'm not a lawyer and also not want to give you legal advises, but ours told us that's fine. If you are real paranoid about sending data to the USA - like we have to be - you can exclude your tracking from very sensitive forms.
To go back to your basic question, if you want to find this out via Google Analytics, your key is "cross domain tracking". Check https://support.google.com/analytics/answer/1034342?hl=en for more information in this direction.
The only work-around I have in my mind beside this, is if you start collecting browser-fingerprints yourself and then connect both collections over the finger prints together (that's not save, as your visitors will use more than one device/configuration). I personally would go for the IP anonimization, exclude very sensitive forms and ensure that your data-privacy declaration contains all necessary parts for and offer an opt-out option then you should be on the safe side.
All the best and TGIF :)

Travel APIs how to integrate them all?

I may start working on a project very similar to Hipmunk.com, where it pulls the hotel cost information by calling different APIs (like expedia, orbitz, travelocity, hotels.com etc)
I did some research on this, but I am not able to find any unique hotel id or any field to match the hotels between several API's. Anyone have experience on how can to compare the hotel from expedia with orbitz or travelcity etc?
Thanks
EDIT: Google also doing the same thing http://www.google.com/hotelfinder/
From what I have seen of GDS systems, and these API's there is rarely a unique identifier between systems for e.g. hotels
Airports, airlines and countries have unique ISO identifiers: http://www.iso-code.com/airports.2.html
I would guess you are going to have to have your own internal mapping to identify and disambiguate the properties.
:|
When you get started with hotel APIs, the choice of free ones isn't really that big, see e.g. here for an overview.
The most extensive and accessible one is Expedia's EAN http://developer.ean.com/ which includes Sabre and Venere with unique IDs but still each structured differently.
That is, you are looking into different database tables.
You do get several identifies such as Name, Address, and coordinates, which can serve for unique identification, assuming they are free of errors. Which is an assumption.

How to determine demographics of users visiting your site?

Ad-Servers seem (and do) know a lot about the use who is visiting a certain webpage leveraging Behavioral and Contextual Targeting. I would love to be able to keep track of that data as well. In particular I would like to know:
age range
male/female
geographical info
I would like this information on a per request basis (not a daily summary)
What is the best way to accomplish this?
Thanks!
There are vendors who specialize in characterizing your Site's traffic. Very roughly they work by finding the closest match to your Site from among a large population of Sites in which they do in fact have detailed demographic data. To improve the matching, some of them give you a javascript snippet to insert into your Site's pages to collect user data and send it to their servers (more or less like web analytics code).
Quantcast is such vendor. The link i included will take you to their page that displays sample audience demographic reports.
Crowd Science is another.
Neither of these are free (though they might have a freemium service, i don't know.
Alexa, on the other hand, is free and offers similar data; just enter your Site's url in their textbox, then when you get the results page, select the Audience tab.
Age and Gender: Ask your users.
Geographical Info: Use GeoIP targeting.
You can try Hitwise, but it's a little on the pricey side IIRC
Doug's is a good answer, but Google Analytics now gives you this too, based on their acquisition of DoubleClick. So it's free.
Google Analytics Demographics & Interests
Note that no matter who you get this information from, the information is based on cross-site information. This is based on "third party cookies" which many users turn off (sometimes without knowing they are doing this) depending on their browser's security/privacy settings.

Basic site analytics doesn't tally with Google data

After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering
I've been experimenting with a very basic analytics system of my own.
MySQL table:
hit_id, subsite_id, timestamp, ip, url
The subsite_id let's me drill down to a folder (as explained in the previous question).
I can now get the following metrics:
Page Views - Grouped by subsite_id and date
Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
The usual "most visited page", "likely time to visit" etc etc.
I've now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.
So I've started discounting IP's from various web crawlers, Google, Yahoo & Dotbot so far.
Short Questions:
Is it worth me collating a list of
all major crawlers to discount, is
any list likely to change regularly?
Are there any other obvious filters
that Google will be applying to GA
data?
What other data would you
collect that might be of use further
down the line?
What variables does
Google use to work out entrance
search keywords to a site?
The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.
Lots of people block Google Analytics for privacy reasons.
Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.
Here's how i've tried to reconcile the disparity when i've come across these studies:
Data Sources recorded in server-side collection but not client-side:
hits from
mobile devices that don't support javascript (this is probably a
significant source of disparity
between the two collection
techniques--e.g., Jan 07 comScore
study showed that 19% of UK
Internet Users access the Internet
from a mobile device)
hits from spiders, bots (which you
mentioned already)
Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:
hits from users behind firewalls,
particularly corporate
firewalls--firewalls block page tag,
plus some are configured to
reject/delete cookies.
hits from users who have disabled
javascript in their browsers--five
percent, according to the W3C
Data
hits from users who exit the page
before it loads. Again, this is a
larger source of disparity than you
might think. The most
frequently-cited study to
support this was conducted by Stone
Temple Consulting, which showed that
the difference in unique visitor
traffic between two identical sites
configured with the same web
analytics system, but which differed
only in that the js tracking code was
placed at the bottom of the pages
in one site, and at the top of
the pages in the other--was 4.3%
FWIW, here's the scheme i use to remove/identify spiders, bots, etc.:
monitor requests for our
robots.txt file: then of course filter all other requests from same
IP address + user agent (not all
spiders will request robots.txt of
course, but with miniscule error,
any request for this resource is
probably a bot.
compare user agent and ip addresses
against published lists: iab.net and
user-agents.org publish the two
lists that seem to be the most
widely used for this purpose
pattern analysis: nothing sophisticated here;
we look at (i) page views as a
function of time (i.e., clicking a
lot of links with 200 msec on each
page is probative); (ii) the path by
which the 'user' traverses out Site,
is it systematic and complete or
nearly so (like following a
back-tracking algorithm); and (iii)
precisely-timed visits (e.g., 3 am
each day).
Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.

Resources