What is an inclusive way of asking for and storing a user's gender and sexual orientation? - user-profile

I need to ask for users' genders and sexual orientations, in a normalized way that allows for filtering users on both properties (so I can't just store user-provider strings). What's the most inclusive way to do this?
For genders, ISO_5218 recommends having the options: 0 (not known), 1 (male), 2 (female) and 9 (not applicable); but this groups a lot of people who identify in many distinct ways in the same bucket. For sexual orientations I haven't found a standard.
Right now I'm thinking of having a set of genders and a set of sexual orientations that users can pick from, as well as having the option to provide a custom string for both. But that's not great for filtering (custom strings would not fall into any filters), and I don't know what genders and sexual orientations to include in the list.
Are there any inclusive established practices for this?

Related

modeling scenario with mostly semi-additive facts

Im learning dimensional modeling and Im trying to create a model. I was thinking about a social media platform which rates hotels. The platform has following data:
hotel information: name and address
a user can rate hotels (1-5 points)
a user can write comments
platform stores the date of the comments
hotel can answer via comment and it stores the date of it
the platform stores the total number of each rating level (i.e.: all rates with 1 point, all rates with 2 point etc.)
platform stores information of the user: sex, name, total number of votes he/she made and address
First, I tried to define which information belongs to a dimension or fact table
(here I also checked which one is additive/semi additive/non-additive)
I realized my example is kind of difficult, because it’s hard to decide if it belongs to a fact table or dimension.
I would like to hear some advice. Would someone agree with my model?
This is how I would model it:
Hotel information -> hotel dimension
User rating -> additive fact – because I can aggregate them with all dimensions
User comment -> semi additive? – because I can aggregate them with the date dimension (I don’t know if my argument is correct, but I know I would have new comments every day, which is for me a reason to store it in a fact table
Answer as comment -> same handling like with the user comments
Date of comment-> dimension
Total Number of all votes (1/2/3/4/5) -> semi-additive facts – makes no sense to aggregate them, since its already total but I would get the average
User information sex and name, address -> user-dimension
User Information: total number of votes -> could be dimension or fact. It depends how often it changes. If it changes often, I store it in a fact. If its not that often, then dimension
I still have question, hope someone can help me:
My Question: should I create two date dimensions, or can I store both information in one date dimension?
2nd Question: each user and hotel just have one address. Are there arguments, to separate the address dimension in a own hierarchy? Can I create a 1:1 relationship to a user dimension and address dimension?
For your model, it looks well considered, but here are some thoughts:
User comment (and answers to comments): they are an event to be captured (with new ones each day, as you mention) so are factual, with dimensionality of the commenter, type of comment, date, and the measure is at least a 'count' which is additive. But you don't want to store big text in a fact so you would put that in a dimension by itself which is 1:1 with the fact, for situations where you need to query on the comment itself.
Total Number of all votes (1/2/3/4/5) are, as you say, already aggregates, mostly for performance. Totals should be easy from the raw data itself so maybe not worthwhile to store them at all. You might also consider updating the hotel dimension with columns (hotel A has 5 '1' votes and 4 '2' votes) that you'd update as you go on, for easy filtering and categorisation.
User Information: total number of votes: it is factual information about a user (dimension) and it depends on whether you always just want to 'find it out' about a person or whether you are likely to use it to filter other information (i.e. show me all reviews for users who have made 10-20 votes). In that case you might store the total in the user dimension (and/or a banding, like 'number of reviews range' with 10-20, 20-30). You can update dimensions often if you need to, but you're right, it could still just live as a fact only.
As for date dimensions, if the 'grain' is 'day' then you only need one dimension, that you refer to from multiple facts.
As for addresses, you're right that there are arguments on both sides! Many people separate addresses into their own dimension, referred to from the other dimensions that use them. Kimball suggests you can do that behind the scenes if necessary, but prefers for each dimension to have its own set of address columns(but modelled as consistently as possible).

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

Game story where a lower score is better

My company is producing a racing game where the best score is the fastest time. Facebook publishes the time as a regular point score, where a higher score is better. This of course is turning it all upside down.
Is there a way to control how a game's score shown in a story? Ideally we would like to show "seconds" instead of points as well.
No, the Scores API currently only supports 'higher is better' for scores.
If you can't rework your scoring scheme to take this into account, consider using Open Graph actions instead - you can have the aggregations which appear on a user's Timeline ordered by whichever field of the object and action you need them to be ordered by,

How do Alexa and Google Analytics track demographics?

How are services like Alexa and Google Analytics capable of tracking visitors' age, gender, college education, and so forth?
http://www.alexa.com/siteinfo/stackoverflow.com
Alexa definitely gets its traffic info from its toolbar users. Since that is a relatively small and self-selecting group of people, this inevitably leads to a biased sample (which is why Alexa traffic doesn't match measured traffic on the sites I run). Even with the best statistical techniques for reducing bias, you can never get rid of it entirely when the sampling distribution is not uniform.
Unclear how Google does it, although it might involve tracking cookies.
A project I have been working on recently has bearing on this question.
Another way to do this (that also has biases, but different ones) would be to use an IP to location service to find the approximate latitude and longitude of each visitor to your site. Then use my project (full disclosure: I run that site and it is commercial):
http://askgeo.com
To get demographic information for that location. AskGeo actually provides demographic information on several geographic levels (state, county, county subdivision, city, ZIP code, census tract (a few thousand people), and census block group (about a thousand people). You'd presumably want to use the lowest level (i.e., census block group) for a given latitude and longitude.
The site returns a huge number of demographic variables. The idea would be to use soft counts from the demographic variables provided on the block group level. To take an example, if you are trying to track the age distribution of your users, then you'd use the age ranges provided in the AskGeo response and for a given sample, you'd add a fractional soft count to each range that corresponds to the percentage of the population in that block group from the corresponding age range. For example, take my neighborhood in San Francisco. It has the following age distribution:
CensusAgePercent0To4: 7.3%
CensusAgePercent5To9: 3.5%
CensusAgePercent10To: 3.2%
... (skipping a bit, as you probably get the idea) ...
CensusAgePercentOver85: 1.5%
If you got an IP address that you tracked to that census block group, you'd add each of those percentages (as a fraction from 0 to 1) to your (soft) counters for those age ranges. (A soft counter is just a counter that allows for non-integer counts.)
You could do the same with race, gender, income level, house values, etc.
This method also has biases, for sure, since it assumes that all the people in a given block group are equally likely to visit your site. But it is something that you can do on your own site, not just Google and Alexa, and it would still give you a relative sense of who is visiting your site if your soft counts in a given category are higher than the national average in that category.
It is also possible that a more sophisticated technique than simple direct counts could lead to a much richer result.
I did some research, and apparently these demographics are tracked the same way TV audience demographics are tracked. There are people who browse with their (Alexa's) toolbars, which keeps track of the sites visited. These people willingly (?) supply information like age, gender, etc. and Alexa extrapolates the general demographics from this sample. This of course leaves room for bias, but that's a problem with statistics.
Alexa gets its information from browser toolbars that you install on purpose or as part of a bundle with some software.
It asks questions to understand demographic params and also tracks sites that you visit. If you know that 80% of site visitors are women and you have new visitor who visits this site that you can think that there is high probability that this person is a woman. If you know a lot of sites this person visits you can guess a lot.
But as http://netberry.co.uk/alexa-rank-explained.htm says you can rely only on information from Alexa TOP100,000 because then Alexa has enough information from small amount of users visiting these sites. They say "millions" but it's small share of total

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Resources