Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().
Related
I have a vector that contains the grades of this year. And I need to test hypothesis using a 1% significance level and using the p-value.
I tried something like this:
t.test(length(x[x >= 6])/length(x), mu = 0.5, alternative = "less" )
But this gives the error: not enough 'x' observations, I understand this error but I have no idea to solve this differently.
The error comes from the fact that you are not testing the right argument. length(x[x >= 6])/length(x) is a scalar: you are therefore testing if a number is equal to 0.5, which does not make sense. To use a Student's t test you have to give a vector as input.
If I understand your objective correctly, you seem to want to test if the proportion of grades above 6 is significantly different from 50%. So you don't need a Student's t test, but rather a binomial test. In fact you want to count how many grades are higher than 6 compared to the total number of grades (n). If this is what you are looking for, the following code should allow you to answer your problem.
prop.test(x = length(x[x>6]), n= length(x), alternative = "less")
R version is 3.5.3 (2019-03-11),have a look of below output:
> t.test(a$score,a$time,paired=FALSE)
Welch Two Sample t-test
data: a$score and a$time
t = -1.4861, df = 8382, p-value = 0.1373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-20215.279 2781.535
sample estimates:
mean of x mean of y
159.0481 8875.9203
p value is 0.1373(>0.05),but the mean of 2 variables are 159.0481 and 8875.9203.
I upload .Rdata to https://file.io/EH9XV44u
Anything wrong with my t.test?
I think the title of your question shows the problem here.
The result that a mean of 159 in one set of data does not demonstrate a significant difference to a mean of 8875 in another set of data doesn't mean '159 equal to 8875'.
It just means that the data allow sufficient uncertainty about the 'true' values of the means (from which the data were picked) that you can't say with confidence that they are different.
Even though - intuitively - you might think that 159 'looks' very different to 8875, we perform a statistical test to verify (or refute) our intuition that this difference couldn't have arisen by chance. In this case, it seems that intuition is wrong.
As Edward & Hong Oui have said in the comments, this is probably because one (or both) of your datasets are very dispersed, so the mean alone doesn't reflect the amount of uncertainty.
An extreme example, which might make this clearer:
data1: c(7,105,365) = mean 149
data2: c(3,22,26600) = mean 8875
It's clear (to me) that we can't be very confident that data1 and data2 are really different, since the difference in the mean value arises from just a single high value in data2. So, although the means seem very different, we don't expect that this will be significant if we test it.
Indeed:
t.test(data1,data2)
# p-value = 0.4291
I guess that if you look closely at your own data you'll find something similar...
I would like to calculate an effect size between scores from pre-test and post-test of my studies.
However, due to the nature of my research, pre-test scores are usually 0 or almost 0 (before the treatment, participants usually do not have any knowledge in question).
I cannot just use Cohen's d to calculate effect sizes since the pre-test scores do not follow a normal distribution.
Is there any way I can calculate effect sizes in this case?
Any suggestions would be greatly appreciated.
You are looking for Cohen's d to see if the difference between the two time points (pre- and post-treatment) is large or small. The Cohen's d can be calculated as follows:
(mean_post - mean_pre) / {(variance_post + variance_pre)/2}^0.5
Where variance_post and variance_pre are the sample variances. Nowhere does it require here that the pre- and post-treatment score are normally distributed.
There are multiple packages available in R that provide a function for Cohen's d: effsize, pwr and lsr. In lsr your R-code would look like this:
library(lsr)
cohensD(pre_test_vector, post_test_vector)
Sidenote: The average scores tend to a normal distribution when your sample size tends to infinity. As long as your sample size is large enough, the average scores follow a normal distribution (Central Limit Theorem).
So I am trying to see how close the sample size calculations (for two sample independent proportions with unequal samples sizes) are between proc power in SAS and some sample size functions in r. I am using the data found here at a UCLA website.
The UCLA site gives parameters as follows:
p1=.3,p2=.15,power=.8,null difference=0, and for the two-sided tests it assumes equal sample sizes;
for the unequal sample size tests the parameters are the same, with group weights of 1 for group1 and 2 for group2, and the tests they perform are one-sided.
I am using the r function
pwr.t.test(n=NULL,d=0,sig.level=0.05,type="two.sample",alternative="two.sided")
from the pwr package.
So if I input the parameter selections as the UCLA site has for their first example, I get the following error:
Error in uniroot(function(n) eval(p.body) - power, c(2, 1e+07)) :
f() values at end points not of opposite sign.
This appears to be because the difference is undetectable by r. I set d=.5 and it ran. Would SAS give error as well for too small difference? It doesn't in the example as their null difference is zero also.
I also get the error above when using
pwr.2p.test(h = 0, n = , sig.level =.05, power = .8)
and
pwr.chisq.test(w =0, N = , df =1 , sig.level =.05, power =.8 ).
I may be doing something horribly wrong, but I cant seem to really find a way if the hypothesized difference is 0.
I understand that SAS and r are using different methods for calculating the power, so I shouldn't expect to get the same result. I am really just trying to see if I can replicate proc power results in r.
I have been able to get near identical results for the first example with equal sample sizes and a two-sided alternative using
bsamsize(p1=.30,p2=.15,fraction=.5, alpha=.05, power=.8)
from the Hmisc package. But when they do 1-sided tests with unequal sample sizes I can't replicate those.
Is there a way to replicate the process in r for the 1-sided sample size calculations for unequal group sizes?
Cheers.
In pwr.t.test and its derivatives, d is not the null difference (that's assumed to be zero), but the effect size/hypothesized difference between the two populations. If the difference between population means is zero, no sample size will let you detect a nonexistent difference.
If population A has a proportion of 15% and population B has a proportion of 30%, then you use the function pwr::ES.h to calculate the effect size and do a test of proportions like:
> pwr.2p.test(h=ES.h(0.30,0.15),power=0.80,sig.level=0.05)
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.3638807
n = 118.5547
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: same sample sizes
> pwr.chisq.test(w=ES.w1(0.3,0.15),df=1,sig.level=0.05,power=0.80)
Chi squared power calculation
w = 0.2738613
N = 104.6515
df = 1
sig.level = 0.05
power = 0.8
NOTE: N is the number of observations
I want to simulate the effect of different kinds of multiple testing correction such as Bonferroni, Fisher's LSD, DUncan, Dunn-Sidak Newman-Keuls, Tukey, etc... on Anova.
I guess I should simply run a regular Anova. And then accept as significant p.values which I calculate by using p.adjust. But I'm not getting how this p.adjust function works. Could give me some insights about p.adjust() ?
when running:
> p.adjust(c(0.05,0.05,0.1),"bonferroni")
# [1] 0.15 0.15 0.30
Could someone explain as to what does this mean?
Thank you for your answer. I kinda know a bit of all that. But I still don't understand the output of p.adjust. I'd expect that...
P.adjust(0.08,'bonferroni',n=10)
... would returns 0.008 and not 0.8. n=10 doesn't it mean that I'm doing 10 comparisons. and isn't 0.08 the "original alpha" (I mean the threshold I'd use to reject the NULL hypothesis if I had one simple comparison)
You'll have to read about each multiple testing correction technique, whether it be False Discovery Rate (FDR) or Family-Wise Error Rate (FWER). (Thanks to #thelatemail for pointing out to expand the abbreviations).
Bonferroni correction controls the FWER by setting the significance level alpha to alpha/n where n is the number of hypotheses tested in a typical multiple comparison (here n=3).
Let's say you are testing at 5% alpha. Meaning if your p-value is < 0.05, then you reject your NULL. For n=3, then, for Bonferroni correction, you could then divide alpha by 3 = 0.05/3 ~ 0.0167 and then check if your p-values are < 0.0167.
Equivalently (which is directly evident), instead of checking pval < alpha/n, you could take the n to the other side pval * n < alpha. So that the alpha remains the same value. So, your p-values get multiplied by 3 and then would be checked if they are < alpha = 0.05 for example.
Therefore, the output you obtain is the FWER controlled p-value and if this is < alpha (5% say), then you would reject the NULL, else you'd accept the NULL hypothesis.
For each tests, there are different procedures to control the false-positives due to multiple testing. Wikipedia might be a good start point to learn about other tests as to how they correct for controlling false-positives.
However, your output of p.adjust, gives in general multiple-testing corrected p-value. In case of Bonferroni, it is FWER controlled p-value. In case of BH method, it is FDR corrected p-value (or also otherwise called q-value).
Hope this helps a bit.