Is it possible to use data stored in cookies or from a web browser to determine if a website's visitor is male or female.
I'd like to use this sort of data to style my site with a different color scheme for males.
It's impossible to know the user's gender for a fact (they could lie; I know many people who do often during registrations to maintain anonymity); but you can estimate based on browser history.
One person created a little hack where you can estimate gender by browser history:
http://www.mikeonads.com/2008/07/13/using-your-browser-url-history-estimate-gender/
QUOTE:
... modified the SocialHistory JS so that it polled the browser to find out which of the Quantcast top 10k sites were visited. I then apply the ratio of male to female users for each site and with some basic math determine a guestimate of your gender. The math is really quite simple, I just take:
1 / (1 + r_1 * r_2 * … * r_n)
where p_i is the ratio of men-to-women for the specific site. For example, if you had been to two sites that had a 2-1 ratio of men to women, the probability of you being female would be:
1 / (1 + 2 * 2) = 1/5 = 20%
If you had access to users' browser history you may be able to derive some method to estimate their gender.
Google Analytics also estimates gender, as well as Quantcast, others, and many cpanel software like awstats.
Users sometimes store their gender in local profiles on their PC, but without illegal spyware to sniff users, there is no clean legal way to gather this information which may or may not be stored in cookies in their browser based on the websites they registered for or are logged into which gathered their gender during registration.
Of course the very illegal way to be certain would be to access the webcam using illegal malware and then using facial recognition software you could set up an automated software to accurately determine the gender of every person who has a laptop or pc with a webcam. However unless you're some crazy hacker with no worries about getting arrested then I would avoid trying this if I were you.
Related
I've been doing some research on intent data and I have some technical questions, especially about how two businesses might be collecting "contact level" i.e. personally identified web traffic details without using third-party cookies.
Some quick background: Most of the large providers of intent data (bombora, the big willow/aberdeen/Spiceworks Ziff Davis, Tech Target etc.) offer "account" based intent data - essentially when users visit websites in their network, they do a reverse IP addresses lookup, match them to know IP addresses of large companies (usually companies with at least 250 employees) and note what topics are "surging" - aka showing unusual traffic on a given week. This largely makes sense to me. I'm assuming that when a visitor shows up at your site, google analytics and similar tools can tell you what google search keywords were used to arrive at your site, and that's how they can say things like - we can "observe intent signals across an unlimited number of contextual keyword categories, allowing you to customize your keywords and layer these insights onto your campaigns for optimal performance." Third party cookies, and data from DSP's (demand side platform's enabling ad buyers to buy ads across many platforms) are also involved in providing data, those these will be less useful sources of data after google sunset's third party cookies on Chrome.
Two providers - intentdata.io, and intentflow.com are offering contact level intent data. You can imagine why that would be of interest - if the director of sales is interested in your sales SaaS tool, you have a better idea of how qualified that lead is and who to reach out to. Only one of the two providers is specific about what exactly they're collecting - i.e. what "intent" they are capturing and how they're collecting it.
Intentdata.io:
Intentdata.io looks like a tiny company (two employees on LinkedIn). The most specific statement I've found about what their data is was in an Impact+ podcast interview - Ed, the CRO at intentdata.io, mentions that the data is analogous to commenting on a Forbes article or a conversation on LinkedIn. But he's clear - "that's just an analogy." They also say elsewhere that the data they provide mentions specifically what action the contact took that landed them in the provided data.
Ed from intentdata.io is also asked about GDPR compliance in his Impact+ interview - he basically says, some lawyers will disagree but he believes their data to be GDPR compliant, and it is in use by some firms in the EU. He does mention though that some firms have asked them to exclude certain columns from the data, like email addresses.
Edit: Found a bit more on intentdata.io - looks like they build a custom setup to pull "intent" data for each customer - they don't have a database monitoring company interaction with content across social media and b2b sites, instead you provide them with "lists (names and URLs) of customers, competitors, influencers, events, target accounts and key terms that would indicate intent at different stages in the buying journey. Pull together important hashtags, details on your ideal buyer (job titles, functions, seniority) and firmographics (size, industry, location)" - then they create a custom "algorithm" from this info, and they iterate on that "algorithm" a little bit over time.
They also make this statement on their site: "IntentData.io's data is collected from observing public actions that users are taking around the web. That means that first, we observe action (not reading, searching, browsing, being shown an ad, etc.) which we believe is a more concrete manifestation of intent. Second, people are taking these actions publicly for the world to see. We do not use any cookies, bidstream data or reverse IP lookups."
Finally one piece of their sales collateral asks: What ad budget do you have for PPC nurturing ads? So their may be some targeted PPC ads involved in the "algorithm."
Edit 2: Their sales collateral also states that they use "a third-party intent data methodology that uses multi-variable linear regression analysis to correlate observed actions with a specific contact. This is the method that the LeadSift engine of IntentData.io data uses."
Intentflow.com:
Intentflow.com seems like the sketchier of the two providers if I'm honest. They provide a video walkthrough of how they get their data at intentflow.com/thesis - but I'm not following how using "traceable urls" with no cookies involved, could give you contact level information. They also say they lookup what the most popular articles/pages are for 5k to 40k unique keywords or phrases that are related to 10-50 keywords or phrases you give them to target. And they use "traceable urls" to track who visits those sites. Again - no cookies involved. Supposedly fully compliant at least with US laws. They don't provide data for the EU "by design" so presumably they're not GDPR compliant? They also claim they can identify the individuals who are visiting your website, again using "traceable urls" - it seems clear from the pitch that you're asked to reach out to your backlink providers around the web to use this traceable url.
I've seen an interview where a rep from Bombora says they tried for a while to do contact level intent data and it wasn't very useful - and it wasn't really doable in a compliant way. Ed seems to be aware they've said that publicly, and he says "that's just not true."
So what's going on here? How exactly are these two small firms getting contact level intent data? Do you think they're doing it in a compliant way?
Got more information:
Intentdata.io use public comments, likes, shares etc. on blogs, social posts via web crawling and scraping for events, influencers, hashtags, articles etc. that the customer deems worth tracking. They do some work to try and connect the commenters with an identifiable contact. They bill on a quarterly basis for this.
Intentflow.com doesn't seem to use "traceable urls" at all. They take bidstream data, and identify the individual visitors via an "identity graph." They provide a minimum of 5k contacts per month at $2 per contact, making their data very expensive ($120k+ per year). You can't get lower than however many contacts their system spits out per month so it seems like there's not a good firm limit on what you will be charged. They say they can identify ~70% of web traffic, and they only provide data on US site visitors. Each row of their output would include not just the contact, but the site that contact was shown an ad on. Definitely interesting data - but I'm guessing they will be very affected by upcoming changes to third party cookies, privacy laws, etc.
I would like to know how I would go about finding out the 'current' MNC from a UK mobile phone number?
I have given out a collection of numbers to companies, and they returned the "original MCC/MNC" & the "current MCC/MNC" codes, all checked out fine.
I would like to know how this was done in the first place? Its easy to find the original MCC/MNC codes, but I'm having trouble with the current MCC/MNC.
To obtain the current MCC / MNC (or NWC - Mobile Network Code) for a mobile number you can take a number of approaches. Based on your comment I'm going to combine a little background information also.
Getting the original MCC / MNC
This is relatively easy as long as you have a reliable source of data. The MCC is relatively simple to deal with as MCC's don't change all that often. MCC being the Mobile Country Code designated by ITU-T. MNC's are a little more tricky because they can change over time. The ITU-T also distributes these allocations and regularly publishes updates or should I say the GSMA does.
Getting current MCC / MNC
Here you have a number of factors to consider. One of them you have already mentioned. Here are some more possibilities:
Mobile porting - Transfer of a mobile phone number from one operator to another
Roaming - The mobile phone number is currently registered on a "foreign" mobile network
Both of these factors mean that just using the mobile phone number is not an option for finding out the current MCC / MNC. It is really a question of how accurate you need the information to be. And of course how much money you want to spend finding it out.
So finally to the original question. The short answer is no you do not have to be a member of ITU to have access to this information. The long answer is that you need access either to the ITU publications. As I recall the following are ways of obtaining the information you need:
GSMA (GSM Association) regularly publishes updates to NWC's (Mobile Network Codes) in document form. This together with numbering schemes for every country using GSM networks.
Neustar (http://www.neustar.biz) provides an API which you can query for the currently registered (non-roaming) mobile phone numbers. They also provide portability information which is updated at various rates depending on country and operator. Effectively they are the root of all portability information for the GSMA.
Some mobile operators for example Deutsche Telekom in Germany provide an API to obtain daily updated portability information for the whole of Germany.
Companies with SS7 connectivity (basically the GSM cloud where mobile operators interoperate) can query realtime the mobile phone numbers current network registration. This also includes whether the mobile phone number is roaming or not.
This information is priceless for many companies and GSMA rightly so ensures that only companies and people who can responsibly manage this information are allowed to obtain it.
You can use this nuget package.
Sample code:
var IsViablePhoneNumber = PhoneNumberUtil.IsViablePhoneNumber("989123456789");
var MCC_MNC = PhoneNumberUtil.GetMCCMNC("989123456789");
var Operator = PhoneNumberUtil.GetOperator("989123456789");
var Brand = PhoneNumberUtil.GetBrand("989123456789");
var OperatorStatus = PhoneNumberUtil.GetOperatorStatus(232 ,10);
var OperatorType = PhoneNumberUtil.GetOperatorType(232 ,10);
Is there a proper way, equation or technique in general to say, "My web application needs to support N number of total users which via this equation/technique/rockHardExperience tells me that I need to support X number of concurrent page requests"?
From my research and/or gut feeling it seems like it would be something like:
totalLoadCapabilityRequired = (totalUsersN x .10 ) * .5
where .10 is for roughly 10% on at any given time
and the whole thing multiplied by 50% to suggest a 50% chance of those total users online executing a request at roughly the same time
any insights would help me in making sure I implement support in my application that is on par for the demand. I expect a lot of users but don't want to over anticipate too early. I know for starters that the org I am programming for will have 45,000 users that they want to use my system, with an anticipation on success for many more.
Here's a couple of things to think about:
What's the time span in which you expect the bulk of your visits? If it's an office application within the same physical company your capacity planning should be based on an 8 hour period. If most visits will come from the same continent you can plan for a 12 hour period instead, etc. Base your visitor spread on that.
Which pages do you anticipate will be the most popular and how heavy are those pages (i.e. how many pages can you load in one second)? Get an understanding of parts that would benefit from caching to squeeze out more performance.
Don't plan based on peak load; design your app to scale and start small.
Design your app in a way that you can take run snapshots at every 500th request; you can use tools like xhprof to create files that you can run through cachegrind tools to analyze the performance as it runs.
In short, there's no catch-all formula :) for a ballpark figure your formula will probably be good enough, but take the above points in consideration.
How are services like Alexa and Google Analytics capable of tracking visitors' age, gender, college education, and so forth?
http://www.alexa.com/siteinfo/stackoverflow.com
Alexa definitely gets its traffic info from its toolbar users. Since that is a relatively small and self-selecting group of people, this inevitably leads to a biased sample (which is why Alexa traffic doesn't match measured traffic on the sites I run). Even with the best statistical techniques for reducing bias, you can never get rid of it entirely when the sampling distribution is not uniform.
Unclear how Google does it, although it might involve tracking cookies.
A project I have been working on recently has bearing on this question.
Another way to do this (that also has biases, but different ones) would be to use an IP to location service to find the approximate latitude and longitude of each visitor to your site. Then use my project (full disclosure: I run that site and it is commercial):
http://askgeo.com
To get demographic information for that location. AskGeo actually provides demographic information on several geographic levels (state, county, county subdivision, city, ZIP code, census tract (a few thousand people), and census block group (about a thousand people). You'd presumably want to use the lowest level (i.e., census block group) for a given latitude and longitude.
The site returns a huge number of demographic variables. The idea would be to use soft counts from the demographic variables provided on the block group level. To take an example, if you are trying to track the age distribution of your users, then you'd use the age ranges provided in the AskGeo response and for a given sample, you'd add a fractional soft count to each range that corresponds to the percentage of the population in that block group from the corresponding age range. For example, take my neighborhood in San Francisco. It has the following age distribution:
CensusAgePercent0To4: 7.3%
CensusAgePercent5To9: 3.5%
CensusAgePercent10To: 3.2%
... (skipping a bit, as you probably get the idea) ...
CensusAgePercentOver85: 1.5%
If you got an IP address that you tracked to that census block group, you'd add each of those percentages (as a fraction from 0 to 1) to your (soft) counters for those age ranges. (A soft counter is just a counter that allows for non-integer counts.)
You could do the same with race, gender, income level, house values, etc.
This method also has biases, for sure, since it assumes that all the people in a given block group are equally likely to visit your site. But it is something that you can do on your own site, not just Google and Alexa, and it would still give you a relative sense of who is visiting your site if your soft counts in a given category are higher than the national average in that category.
It is also possible that a more sophisticated technique than simple direct counts could lead to a much richer result.
I did some research, and apparently these demographics are tracked the same way TV audience demographics are tracked. There are people who browse with their (Alexa's) toolbars, which keeps track of the sites visited. These people willingly (?) supply information like age, gender, etc. and Alexa extrapolates the general demographics from this sample. This of course leaves room for bias, but that's a problem with statistics.
Alexa gets its information from browser toolbars that you install on purpose or as part of a bundle with some software.
It asks questions to understand demographic params and also tracks sites that you visit. If you know that 80% of site visitors are women and you have new visitor who visits this site that you can think that there is high probability that this person is a woman. If you know a lot of sites this person visits you can guess a lot.
But as http://netberry.co.uk/alexa-rank-explained.htm says you can rely only on information from Alexa TOP100,000 because then Alexa has enough information from small amount of users visiting these sites. They say "millions" but it's small share of total
I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.