Google Optimize: "Calculated Conversion Rate" contradicting "Probability to be Best" - google-analytics

I'm seeing some strange things on a currently running experiment.
The Observed Conversion Rate is noticeably higher on the Original,
however Optimize's analysis is that the variant is more likely to be
best. What should I trust here?
The Modeled Conversion Rate seems
also way off, with a median CR of 1.2% (original) and 1.3% (variant).
Why would the model overshoot reality so much?
Thanks for your guidance!

Related

What should the default power be in power.prop.test?

Is there an industry standard for the power section of the power.prop.test?
I am using the function to find p2 but not sure what is the standard for power.
power.prop.test(
n= 6289195,
p1=0.004,
power=0.8,
sig.level=0.05,
tol=.Machine$double.eps^.8)
For example, should it be 0.8 or 0.9?
This is a practical statistics question rather than an R question, but a power of 0.8, i.e. 80%, is common. Since it is common (rather like 95% confidence), people think they understand what it is saying and do not query its choice as much as they might another values.
You need to remember that it is an arbitrary target: if you changed it in your example, then the main impact would be to give you a different result for p2. Really you should be explicitly balancing the cost of increasing the sample size with against the different costs of Type I and particular Type II errors
A common reference is to Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic, section 2.4, which says:
It is proposed here as a convention that, when the investigator has no other basis for setting the desired power value, the value .80 be used. This means that b is set at .20. This arbitrary but reasonable value is offered for several reasons (Cohen, 1965, pp. 98-99). The chief among them takes into consideration the implicit convention for a of .05. The b of .20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc.
Other examples of 0.8 found in a quick search:
The R stats reference page for power.prop.test uses power=0.8 as an example
A University of Ottawa medicine page says "A power of 80% is often chosen; hence a true difference will be missed 20% of the time. This is a compromise because raising power to 90% power will require increasing the sample size by about 30%"
The Statistics Done Wrong site and book says "A scientist might want to know how many patients are needed to test if a new medication improves survival by more than 10%, and a quick calculation of statistical power would provide the answer. Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of concluding there’s a real effect. However, few scientists ever perform this calculation, and few journal articles ever mention the statistical power of their tests."

How do I solve this by R, with power.t.test maybe?

A graduate program is interested in estimating the average annual income of its alumni. Although no prior information is available to estimate the population variance, it is known that most alumni incomes lie within a $10,000 range. The graduate program has N = 1,000 alumni. Determine the sample size necessary to estimate with a bound on the error of estimation of $500.
I know how to deal with it statistically, but I don't know if I have to use R.
power.t.test requires 4 arguments: delta, sig.level, sd,power (since n is what I want).
I know that sd can be calculated with 10000 range and = 10000/4 = 2500
but how to deal with the rest three?
Addition:
I googled about how to do this statistically(mathematically).
It is the book Elementary Survey Sampling by R.L.Scheaffer and W.Mendenhall. Page 88. StackOverflow doesn't allow me to add a picture yet, so I just share the link here.
https://books.google.co.jp/books?id=nUYJAAAAQBAJ&pg=PA89&lpg=PA89&dq=Although+no+prior+information+is+available+to+estimate+the+population+variance&source=bl&ots=Kqt7Cc5FFv&sig=Vx2bBRyi2KfrgMGkaC0f1EnfTWM&hl=en&sa=X&redir_esc=y#v=onepage&q&f=false
With the formulae provided, I can calculate that the sample size required to solve the question is 91. Anyone can show me how to do this with R pls?
Thanks in advance!
ps. Sorry about my crap English and crap format... I have not been familiar with this website yet.

Gatling 2: Arithmetic average metric in report

Currently the Gatling report provides the following metrics
minimum
maximum
50th percentile
75th percentile
95th percentile
99th percentile
mean
standard deviation
but it does not provide the arithmetic average.
I have found a related discussion about this in the google group, but am not sure about the outcome.
In general I agree that the average is not a good metric to be used. Percentiles are way more representative - no doubt about that. So from an engineering point of view I don't see a problem with the currently supported metrics.
But a problem arises from a legal point of view (as usual :). It seems there are ill conceived SLAs out there which explicitly define average request duration. Since Gatling does not emit the average metric, boneheaded customers won't accept the "Ready-to-present HTML reports" as a proof that such SLA terms are met...
So my question is: Is there a way to add the average metric to the report output?
mean and average are exactly the same thing.

In Google Analytics Experiments why does "Probability of Outperforming Original" appear to contradict conversion rate?

Variation 1 and Variation 5 (see below) both have lower conversion rates than the original yet they are both labeled as more likely than not to outperform the original.
Am I seeing an error? If not, could someone shed some light onto how this Probability of Outperforming Original value is calculated? Thanks.
Original
2,071 Experiment Sessions
1,055 Conversions
50.94% Conversion Rate
0% Compare to Original
0.0% Probability of Outperforming Original
Variation 2
1,028 Experiment Sessions
541 Conversions
52.63% Conversion Rate
3% Compare to Original
69.2% Probability of Outperforming Original
Variation 4
1,786 Experiment Sessions
914 Conversions
51.18% Conversion Rate
0% Compare to Original
61.7% Probability of Outperforming Original
Variation 1
523 Experiment Sessions
258 Conversions
49.33% Conversion Rate
-3% Compare to Original
58.0% Probability of Outperforming Original
Variation 5
837 Experiment Sessions
423 Conversions
50.54% Conversion Rate
-1% Compare to Original
53.2% Probability of Outperforming Original
Variation 3
517 Experiment Sessions
242 Conversions
46.81% Conversion Rate
-8% Compare to Original
44.0% Probability of Outperforming Original
It is not any basic or easy calculation to put here some equitation. Calculation of Google Experiments is based on the "problem" of Multi-armed Bandit.
This is a concept that describes any situation in which you want to conduct an experiment in such a way that you maximize your reward
Full description is available on Google Documentation - here:
https://support.google.com/analytics/answer/2844870?hl=en
Experiments based on multi-armed bandits are typically much more efficient than "classical" A-B experiments based on statistical-hypothesis testing. They’re just as statistically valid, and in many circumstances they can produce answers far more quickly.
They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a "final answer" at the end of an experiment.
They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the "good" arms from the "best" ones more quickly.
Calculation example is here: https://support.google.com/analytics/answer/2846882
I hope it helps you to understand a bit more how Google is calculating the winner.

How should I analyze web traffic in a statistically correct way?

I have a file with a sequence of event timestamps corresponding to the times at which someone visits a website:
02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:01
02.02.2010 09:00:03
02.02.2010 09:00:05
02.02.2010 09:00:06
02.02.2010 09:00:06
02.02.2010 09:00:09
02.02.2010 09:00:11
02.02.2010 09:00:11
02.02.2010 09:00:11
etc, for several thousand rows.
I'd like to get an idea how the web hits are distributed over time, over the week etc. I need to know how I should scale the (future) web servers in order to guarantee service availability with a given number of nines. In particuler I need to give upper bounds on the number of almost-concurrent visits.
Are there any resources out ther that explain how to do that? I'm fluent in mathematics and statistics, and I've looked at queuing theory but it seems that that theory assumes the rate of arrival to be independent of the time of the day, which is clearly wrong in my case. And NO, histograms are not the right answer since the result depends heavily on bin width and placement.
You can always place a more flexible model on the arrive rate parameter. For instance, make the arrive rate a function of time, or place some time-series style model on it. Whatever makes sense for your data. The literature typically focuses on the core model because extensions are application specific.
In an extended model, you'll almost certainly want to use Bayesian methods. You are interested in the posterior predictive distribution of the object "almost concurrent events." A recent paper in JASA describes nearly your exact problem, applied to call center data:
Bayesian Forecasting of an Inhomogeneous Poisson Process With Applications to Call Center Data
For a quick solution, don't underestimate the power of histogram style estimators. They are simple nonparametric estimators and you can cross-validate tuning parameters like binwidth and placement. Theoretically this is somewhat unsatisfying, but it would take a day to implement. A fully Bayesian approach likely will dominate, but at significant computational cost.
Well, prepare yourself for a whole lot of 'what's wrong with AWStats/Webalizer/Analog-Stats/favourite-http-log-stats-viewer-of-the-month' responses...
They all do histograms, but thats because they are designed to help give a broad-at-a-glance picture of visitor traffic.
I recommend that you take a look at Splunk to see if it meets your requirements.
If you don't want to use a histogram, could you just graph the kernel density?
Can nearly concurrent visits be defined or approximated as those that occur in the same second? If yes, here is how I would proceed:
For each second in the data calculate the number of visits. This will include some seconds with 0 visits - don't exclude them.
It is probably reasonable to assume that the number of visits per second has a Poisson distribution with a rate that changes over the day, and perhaps over the week. So decide what are the relevant predictors (time of day, day of the week, month?) and use Poisson regression to model the counts. You can use splines for the continuous variables (eg time of day), I believe there are even some "cyclic" splines that can take into account that 11:58PM is close to 00:02 AM. Or you can cut time into smaller discrete pieces, say 10 minute intervals. If you want to be really fancy, incorporate autocorrelation and overdispersion in the model.
Based on the fitted model, you can estimate whatever percentile you want.
Of course, this is pretty fancy statistically, and you have to know what you are doing, but I think it could work.
You're right, most of the theory assumes a Poisson distribution of hits, which you don't have because the rate of hits varies with time of day. However, couldn't you stratify your data into, say, one block for each hour of the day and assume that within a single hour the distribution of hits per second/minute/whatever unit is approximately Poisson? There are probably better ways (from a theoretical perspective), but this way has the advantage of being simple to implement and simple to explain to anyone with any statistical background.
I think you could argue that your hits are distributed according to a poisson distribution where the average and variation vary with the time of day.
To get a good idea of the peak load I'd start with just a scatterplot with the time of the hit on the horizontal axis and the time between that hit and the next hit on the vertical axis.
This should give you a good idea of the height and duration of your peaks. Then you can estimate the parameters of the poisson distribution for a sliding window of a length similar to that duration for every moment of the day. Sort of like a moving average. The area's where mean and variance are lowest will give you a good basis for estimating expected future peak load.

Resources