Sample size for google content experiment - google-analytics

Can anybody give me any idea about what kind of traffic / sample size I need to get a statistically significant result when doing a google content experiement for 2 variations?

Google uses Multi Armed Bandit testing.
Here is a good article on this Googles answer
The best way in practice is to watch the percentage in the Google analytics experiments tab and see how quickly it moves toward 95%.
You can't get an exact answer because it changes as you take measurements and based on the difference you are trying to measure. So if one variation performs 300% better than the other it will take a lot smaller sample size than if one variation only performs 10% better than the other.
To see how the math for straight up statistical significance works here is a good explanation. Statistical significance tutorial
Here is a spot where it has a calculator Calculator
As far as the math for the Multi Armed Bandit this quote by Peter Whittle sums it up
[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.

Related

Choosing a similarity metric for user-scores of television shows

I have a database of user ratings of various television shows on a 1-10 scale. I've been trying to find a good way of determining how similar two user score-lists are two one another for shared shows.
It feels like the most obvious way to do this is just to take the absolute value of the difference. And then sum/average that for all shared shows. But I was reading this does not take into account how users will rate things on different scales. I saw some people saying cosine similarity is better for this sort of thing. Unfortunately, I've run into a lot of cases where that metric doesn't really make sense.
Example:
overall average of user1 = 8.1
overall average of user2 = 5.8
scores for shared shows only:
S1 = [8,8,10,10,10,10,6,8,10,5,6,10]
S2 = [5,6,7,8,9,9,4,5,9,1,2,8]
Obviously, these two people rated the shows they watched pretty differently. When I use the average difference it says they are not very similar (2.3 where 0 is the same). When I use something like the cosine similarity it says they are extremely similar (0.97 where 1 is the same).
Is there a metric that would be better suited for this kind of thing? My ultimate goal is to recommend users shows from other users that have similar tastes to them.

displaying element only for part of the users

I want to show elements only to x% of the visitors to my page.
Considering low traffic - is Random out of x% is enough in your opinion? or is it good only for high taffic?
Thanks
This seems like A/B testing or multivariate testing which you can do with Google website optimizer or a asp.net specific library like fairlycertain
Just using a random number wouldn't be a good solution in any case, low or high traffic the distribution of numbers could very well not be even at all.

How to mitigate against bandwagon effect (voting behavior) in my ranking system?

What I mean by bandwagon effect describes itself like so:
Already top-ranked items have a higher tendency to get voted on at all, possibly even to get upvoted.
What I am hoping to get is some concrete recommendations, at best based on your practical experience with a mathematical formula and in which situation it helped.
However, any useful pointers are more than welcome!
My ranking system
Please consider a ranking system at a website that has a reputation system and where users cast only upvotes on items and the ranking table is reset to start fresh every month.
Every user has one upvote per item within each month, and there is a reward for users who, within a certain month, upvoted an item that made it into the top ranks at the end of that month.
Users are told the following about what increases the weight of their upvote:
1)... the more reputation you have at the time of upvoting
2)... the fewer items you upvote within the current month (including the current upvote)
3)... the fewer upvotes that item already has within the current month before your own upvote
The ranking table is recalculated once a day and is visible to all.
Goal
I'd like to implement part 3) in an effort to correct the ranks of items where one cannot tell if some users just upvoted it because of the bandwagon effect (those users might hope to gain a "tactical" advantage simply by voting what they perceive lots of other users already upvoted)
Also, I hope to mitigate this way against the possible use of sock puppets that managed to attain some reputation, but upvote the same item or group of items.
Question
Is there a (maybe even tested?) mathematical formula that I could just apply on the time-ordered list of upvotes for each item to get a coffecient for each of those upvotes so that their weights will be corrected in a sensible fashion?
I'm thinking it's got to be something of a lograthmic function but I can't quite get a grip on it...
Thank you!
Edit
Zack says: "beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed"
To further clarify: what I am after is which actual mathematical approaches are worth trying out that will, in the form of a mathematical function, translate this descrease in pop (i.e., apply coefficients to the weights, see above) in sensible, balanced manner.
My hope is someone has practical experience with such approaches in a simmilar or general situation to the one above.
Consider applying the "Indie Rock Peter Principle": beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed.
Term coined by Leonard Richardson in this paper. Indie Rock Peter is of course from Diesel Sweeties.
I have always disliked the bandwagon effect in voting systems, especially "most viewed" rankings in which simply clicking on a highly ranked item increases its rank. My solution to this problem, which I have never tested or seen implemented, would be to keep track of how an item was reached (and then voted for), and ignore (or greatly decrease the weight of) votes that came from any sorted-by-ranking page.

How do you estimate a ROI for clearing technical debt?

I'm currently working with a fairly old product that's been saddled with a lot of technical debt from poor programmers and poor development practices in the past. We are starting to get better and the creation of technical debt has slowed considerably.
I've identified the areas of the application that are in bad shape and I can estimate the cost of fixing those areas, but I'm having a hard time estimating the return on investment (ROI).
The code will be easier to maintain and will be easier to extend in the future but how can I go about putting a dollar figure on these?
A good place to start looks like going back into our bug tracking system and estimating costs based on bugs and features relating to these "bad" areas. But that seems time consuming and may not be the best predictor of value.
Has anyone performed such an analysis in the past and have any advice for me?
Managers care about making $ through growth (first and foremost e.g. new features which attract new customers) and (second) through optimizing the process lifecycle.
Looking at your problem, your proposal falls in the second category: this will undoubtedly fall behind goal #1 (and thus get prioritized down even if this could save money... because saving money implies spending money (most of time at least ;-)).
Now, putting a $ figure on the "bad technical debt" could be turned around into a more positive spin (assuming that the following applies in your case): " if we invest in reworking component X, we could introduce feature Y faster and thus get Z more customers ".
In other words, evaluate the cost of technical debt against cost of lost business opportunities.
Sonar has a great plugin (technical debt plugin) to analyze your sourcecode to look for just such a metric. While you may not specifically be able to use it for your build, as it is a maven tool, it should provide some good metrics.
Here is a snippet of their algorithm:
Debt(in man days) =
cost_to_fix_duplications +
cost_to_fix_violations +
cost_to_comment_public_API +
cost_to_fix_uncovered_complexity +
cost_to_bring_complexity_below_threshold
Where :
Duplications = cost_to_fix_one_block * duplicated_blocks
Violations = cost_to fix_one_violation * mandatory_violations
Comments = cost_to_comment_one_API * public_undocumented_api
Coverage = cost_to_cover_one_of_complexity *
uncovered_complexity_by_tests (80% of
coverage is the objective)
Complexity = cost_to_split_a_method *
(function_complexity_distribution >=
8) + cost_to_split_a_class *
(class_complexity_distribution >= 60)
I think you're on the right track.
I've not had to calculate this but I've had a few discussions with a friend who manages a large software development organisation with a lot of legacy code.
One of the things we've discussed is generating some rough effort metrics from analysing VCS commits and using them to divide up a rough estimate of programmer hours. This was inspired by Joel Spolsky's Evidence-based Scheduling.
Doing such data mining would allow you to also identify clustering of when code is being maintained and compare that to bug completion in the tracking system (unless you are already blessed with a tight integration between the two and accurate records).
Proper ROI needs to calculate the full Return, so some things to consider are:
- decreased cost of maintenance (obviously)
- opportunity cost to the business of downtime or missed new features that couldn't be added in time for a release
- ability to generate new product lines due to refactorings
Remember, once you have a rule for deriving data, you can have arguments about exactly how to calculate things, but at least you have some figures to seed discussion!
I can only speak to how to do this empirically in an iterative and incremental process.
You need to gather metrics to estimate your demonstrated best cost/story-point. Presumably, this represents your system just after the initial architectural churn, when most of design trial-and-error has been done but entropy has had the least time to cause decay. Find the point in the project history when velocity/team-size is the highest. Use this as your cost/point baseline (zero-debt).
Over time, as technical debt accumulates, the velocity/team-size begins to decrease. The percentage decrease of this number with respect to your baseline can be translated into "interest" being paid on each new story point. (This is really interest paid on technical and knowledge debt)
Disciplined refactoing and annealing causes the the interest on technical debt to stablize at some value higher than your baseline. Think of this as the steady-state interest the product owner pays on the technical debt in the system. (The same concept applies to knowledge debt).
Some systems reach the point where the cost + interest on each new story point exceeds the value of the feature point being developed. This is when the system is bankrupt, and it's time to rewrite the system from scratch.
I think it's possible to use regression analysis to tease apart technical debt and knowledge debt (but I haven't tried it). For example, if you assume that technical debt correlates closely with some code metrics, e.g. code duplication, you could determine the degree the interest being paid is increasing because of technical debt versus knowledge debt.
+1 for jldupont's focus on lost business opportunities.
I suggest thinking about those opportunities as perceived by management. What do they think affects revenue growth -- new features, time to market, product quality? Relating debt paydown to those drivers will help management understand the gains.
Focusing on management perceptions will help you avoid false numeration. ROI is an estimate, and it is no better than the assumptions made in its estimation. Management will suspect solely quantitative arguments because they know there's some qualitative in there somewhere. For example, over the short term the real cost of your debt paydown is the other work the programmers aren't doing, rather than the cash cost of those programmers, because I doubt you're going to hire and train new staff just for this. Are the improvements in future development time or quality more important than features these programmers would otherwise be adding?
Also, make sure you understand the horizon for which the product is managed. If management isn't thinking about two years from now, they won't care about benefits that won't appear for 18 months.
Finally, reflect on the fact that management perceptions have allowed this product to get to this state in the first place. What has changed that would make the company more attentive to technical debt? If the difference is you -- you're a better manager than your predecessors -- bear in mind that your management team isn't used to thinking about this stuff. You have to find their appetite for it, and focus on those items that will deliver results they care about. If you do that, you'll gain credibility, which you can use to get them thinking about further changes. But appreciation of the gains might be a while in growing.
Being a mostly lone or small-team developer this is out of my field, but to me a great solution to find out where time is wasted is very, very detailed timekeeping, for example with a handy task-bar tool like this one that can even filter out when you go to the loo, and can export everything to XML.
It may be cumbersome at first, and a challenge to introduce to a team, but if your team can log every fifteen minutes they spend due to a bug, mistake or misconception in the software, you accumulate a basis of impressive, real-life data on what technical debt is actually costing in wages every month.
The tool I linked to is my favourite because it is dead simple (doesn't even require a data base) and provides access to every project/item through a task bar icon. Also entering additional information on the work carried out can be done there, and timekeeping is literally activated in seconds. (I am not affiliated with the vendor.)
It might be easier to estimate the amount it has cost you in the past. Once you've done that, you should be able to come up with an estimate for the future with ranges and logic even your bosses can understand.
That being said, I don't have a lot of experience with this kind of thing, simply because I've never yet seen a manager willing to go this far in fixing up code. It has always just been something we fix up when we have to modify bad code, so refactoring is effectively a hidden cost on all modifications and bug fixes.

Algorithm for similarity (of topic) of news items

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.
First thoughts:
toss away noise words (and, you, is, the, some, ...).
count all other words and sort by quantity.
for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
the score represent the similarity.
It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.
However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.
Based on your comment:
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).
That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).
I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).
One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.
At the moment I am thinking of something like this.
Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.
The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.
Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.
But comments and improvements still highly appreciated.

Resources