Contradictive google content experiment results - google-analytics

We are using Google Experiments heavily through the experiments api and recently started seeing experiments ending with the winner having lower score than other variations -- see the attached screenshots of two recent experiments.
Anyone else seen that? Do you have any idea what does this mean?

Probably because confidence it's really low, so Google it's not confident to apply any variation.
Choosing the winner of your experiment
Over time, you should see either the original ad setting or the
variation outperform the other. To help you decide the right time to
choose your winner, we calculate a confidence score for the two
settings. This score indicates how likely the original or the
variation is to be the better performing (e.g., the highest earning)
We recommend that you wait until one of the scores reaches at least
95% before you choose that setting as the winner. When a setting has a
95% confidence score it means that we are 95% certain that that
setting is the better performing of the two.
You can read more:
I suggest also this page:
When you're choosing the winner of your experiment, we encourage you
to consider the Overall Change as well as the revenue (RPM) and the
relative Quality Score. The Overall Change equals the change in RPM
plus the change in Quality Score of the variation. In general, a
positive Overall Change will result in better long-term performance,
even when either the RPM change or the Quality Score change is
negative. In situations where the RPM and the Quality Score point to
different winners, a positive Overall Change can help you to make a
final decision.


Percentage value on firebase web performance distribution graphs

I'm reading my reports from firebase web performance and I don't understand what means the percentage value in the distribution graph.
Bellow a print of the loadEventEnd with the graph.
Is the value 95% represents users affected with the load time of 11.85s?
What you're seeing there is the 95th Percentile number. That means that if you ordered all events from fastest to slowest, five percent of all requests took 11.85s or longer.
Percentile metrics are very useful for measuring performance problems in the margins -- your median load time may be great, but if 5% of users are experiencing a very long load time, it might be worth trying to dig into why and optimizing it more.
A great exercise might be to use the filters that are available to try to find a cohort of users for whom the median number is closer to that 95p number -- for instance, a specific browser or geographical region. That will give you more insight into who exactly is having a slower experience on your site.

Choosing a similarity metric for user-scores of television shows

I have a database of user ratings of various television shows on a 1-10 scale. I've been trying to find a good way of determining how similar two user score-lists are two one another for shared shows.
It feels like the most obvious way to do this is just to take the absolute value of the difference. And then sum/average that for all shared shows. But I was reading this does not take into account how users will rate things on different scales. I saw some people saying cosine similarity is better for this sort of thing. Unfortunately, I've run into a lot of cases where that metric doesn't really make sense.
overall average of user1 = 8.1
overall average of user2 = 5.8
scores for shared shows only:
S1 = [8,8,10,10,10,10,6,8,10,5,6,10]
S2 = [5,6,7,8,9,9,4,5,9,1,2,8]
Obviously, these two people rated the shows they watched pretty differently. When I use the average difference it says they are not very similar (2.3 where 0 is the same). When I use something like the cosine similarity it says they are extremely similar (0.97 where 1 is the same).
Is there a metric that would be better suited for this kind of thing? My ultimate goal is to recommend users shows from other users that have similar tastes to them.

Single-stat percentage change from initial value in graphite/grafana?

Is there a way to simply show the change of a value over the selected time period? All I'm interested in is the offset of the last value compared to the initial one. The values can vary above and below these over the time period, it's not really relevant (and would be exceptions in my case).
For an initial value of 100 and an final value of 105, I'd expect a single stat box displaying 5%.
I have the feeling I'm missing something obvious obvious, but can't find a method to display this deceptively simple task.
I'm trying to create a scripted Grafana dashboard that will automatically populate disk consumption growth for all our various volumes. The data is already in Graphite, but for purposes of capacity management and finance planning (which projects/departments gets billed) it would be helpful for managers to have a simple and coarse overview of which volumes grow outside expected parameters.
The idea was to create a list of single-stat values with color coding that could easily be scrolled through to find abnormalities. Disk usage would obviously never be negative, but volatility in usage between the start and end of the time period would be lost in this view. That's not a big concern for us as this is all shared storage and such usage is expected to a certain degree.
The perfect solution would be to have the calculations change dynamically based on the selected time period.
I'm thinking that this is not really possible (at least not easily) to do with just Graphite and Grafana and have started looking for alternative methods. We might have to implement a different reporting system for this purpose.
Edit 2
I've tried implementing the suggested solution from Leonid, and it works after a fashion. The calculations seems somewhat off from what I expected though.
My test dashboard looks like follows:
If I were to calculate the change manually, I'd end up with roughly 24% change between the start (7,23) and end (8.96) value. Graphite calculates this to 19%. It's probably a reason for the discrepancy, probably something to do with it being a time-series and not discreet values?
As a sidenote: The example is only 30 days, even though the most interesting number would be a year. We don't have quite a year of data in Graphite yet and having a 30 day view is also interesting. It seems I have to implement several dashboards with static times.
You certainly can do that for some fixed period. For example following query take absolute difference betweent current metric value and value that metric has one minute ago (i.e. initial value) and then calculate it's percentage of inital value.
asPercent(absolute(diffSeries(my_metric, timeShift(my_metric, '1m'))), timeShift(my_metric, '1m'))
I believe you can't do that for time period selected in Grafana picker.
But is that really what you need? It's seems strange because as you said value can change in both directions. Maybe standard deviation would be more suitable for you? It's available in Graphite as stdev function.

Sample size for google content experiment

Can anybody give me any idea about what kind of traffic / sample size I need to get a statistically significant result when doing a google content experiement for 2 variations?
Google uses Multi Armed Bandit testing.
Here is a good article on this Googles answer
The best way in practice is to watch the percentage in the Google analytics experiments tab and see how quickly it moves toward 95%.
You can't get an exact answer because it changes as you take measurements and based on the difference you are trying to measure. So if one variation performs 300% better than the other it will take a lot smaller sample size than if one variation only performs 10% better than the other.
To see how the math for straight up statistical significance works here is a good explanation. Statistical significance tutorial
Here is a spot where it has a calculator Calculator
As far as the math for the Multi Armed Bandit this quote by Peter Whittle sums it up
[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.

Ratingsystem that considers time and activity

I'm looking for a rating system that does not only weight the rating on number of votes, but also time and "activity"
To clarify a bit:
Consider a site where users produce something, like a picture.
There is another type of user that can vote on other peoples pictures (on a scale 1-5), but one picture will only recieve one vote.
The rating a productive user gets is derived from the rating his/hers pictures have recieved, but should be affected by:
How long ago the picture was made
How productive the user has been
A user who's getting 3's and 4's and still making 10 pictures per week should get higher rating than a person that have gotten 5's but only made 1 pic per week and stopped a few month ago.
I've been looking at Bayesian estimate, but that only considers the total amount of votes independent of time or productivity.
My math fu is pretty strong, so all I need is a nudge in right direction and I can probably modify something to fit my needs.
There are many things you could do here.
The obvious approach is to have your measure of the scores decay with time in your internal calculations, for example using an exponential decay with a time constant T. For example, use value = initial_score*exp(-t/T) where t is the time that's passed since picture was submitted. So if T is one month, after one month this score will contribute 1/e, or about 0.37 that it originally did. (You can also do this differentially, btw, with value -= (dt/T)*value, if that's more convenient.)
There's probably a way to work this with a Bayesian approach, but it seems forced to me. Bayesian approaches are generally about predicting something new based on a (usually large) set of prior data, which doesn't directly match your model.
