I am calculating the sample size for proportion test. I would like to have significance level =0.05, power = 0.90 and that the effect size is greater that 5%.
I would like to have statistically significance result if the difference in proportions is more that 5%.
But when I use pwr.2p.test function from pwr package to calculate sample size
pwr.2p.test(sig.level = 0.05, power =0.9, h=0.2, alternative="greater")
I have to specify effect size as Cohen's D. But it's range is said to be in (-3,3), and interpretation of this is:
The meaning of effect size varies by context, but the standard interpretation offered by Cohen (1988) is: cited from here
.8 = large (8/10 of a standard deviation unit)
.5 = moderate (1/2 of a standard deviation)
.2 = small (1/5 of a standard deviation)
My question is, how to formulate that I'd like to detect that there is more that 5% difference in proportions in 2 groups in a Cohen's d statistic?
Thanks for any help!
I used the function ES.h of the package pwr. This function calculate the Effect Size between two proportions. For p1 = 100% and p2 = 95%, we have:
h = ES.h(1, 0.95) = 0.4510268
I understand that this effect size informs the need to detect the distance between the hypothesis.
I'm not very secure in my interpretation, but I used this value to determine the sample size.
pwr.p.test(h=h, sig.level = 0.05, power = 0.8)
Determining the sample size to detect up to 5 points difference in the proportions:
n = 38.58352
To detect a difference of 10 points, the sample size decreases because the accuracy decreases. So, to h = ES.h(1, 0.90) = 0.6435011, so we have: n = 18.95432.
This is my interpretation? What do you think? Am I right?
Related
I am flipping each coin 100 times in a bag of 50 coins and then I want to use the Method of Maximum statistics in order to determine the Family Wise Error Rate. However, I keep getting an FWER of 1 which feels wrong.
coins <- rbinom(50, 100, 0.5)
So I start by defining a new function where we input how many times we do randomizations, the coins themselves, and how many times we flip them.
simulate_max <- function(n_number_of_randomizations, input_coins, N_number_of_tosses, alpha = 0.05) {
maxList <- NULL
Then we do a for loop for every time we have specified.
for (iteration in 1:n_number_of_randomizations){
Now we shuffle the list of coins
CoinIteration <- sample(input_coins)
Now we apply the binary test to every coin in the bag
testresults <- map_df(CoinIteration, function(x) tidy(binom.test(x,N_number_of_tosses,p=alpha)) )
Now we want to add the maximum result from every test to the max list.
thisRandMax <- max(testresults$statistic)
maxList <- c(maxList, thisRandMax)
}
Finally, we iterate through every member of the maximum list to subtract the expected value of heads (ie 50 for 50% chance * 100 tosses.
for (iterator2 in 1:length(maxList)){
maxList[iterator2]<-maxList[iterator2]-(0.5*N_number_of_tosses)
}
Return the output from the function
return(data.frame(maxList))
}
Now we apply this simulation for each of the requested iterations.
repsmax = map_df(1:Nreps, ~simulate_max(Nrandomizations,coins,Ntosses))
Now we calculate the fwer by dividing the increased amount by the total number of cells.
fwer = sum(repsmax>0) / (Nreps*Nrandomizations)
There are some issues that I think would be good to clarify.
A FWER of ~1 seems about right to me given the parameters of your experiment. FWER relates to Type I error, and for a single normally distributed test at alpha = 0.05, FWER = 1 - P(Type I error = 0); FWER = 1 - 0.95 = 0.05. For two tests at alpha = 0.05, FWER = 1 - P(Type I error = 0); FWER = 1 - 0.95^2 = 0.0975. You have 50 coins (50 tests), so your FWER at alpha = 0.05 is 1 - 0.95^50 = 0.923. If your code treats the 100 coins as 100 tests, your FWER will be = 0.996 (~1).
You can control for Type I error (account for multiple testing) by using e.g. the Bonferroni correction (alpha / n). If you change your alpha to "0.05 / 50" = 0.001, you will control your FWER (reduce it) to 0.05 (1 - 0.999^50 = ~0.049). I suspect this is the answer you are looking for: if alpha = 0.001 then FWER = 0.05 and you have an acceptable chance of incorrectly rejecting the null hypothesis.
I don't know what the "maximum estimate of the effect size" is, or how to calculate it, but given that the two distributions are approximately identical, the effect size will be ~ 0. It then makes sense that controlling FWER to 0.05 (by adjusting alpha to 0.001) is the 'answer' to the question and if you can get your code to reflect that logic, I think you'll have your solution.
It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!
The following link provided me with a greater understanding of incorporating ordinary cost in my binary classification model: https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
With a standard classifier, the default threshold is usually 0.5, and the aim is to minimize the total number of misclassification errors as much as possible (obtain the maximum accuracy). However, all misclassification errors are treated equally. This is not typically the case in a real-world setting since the cost of a false negative may be much greater than that of a false negative.
Using empirical thresholding, I was able to obtain the optimal threshold value for classifying the instance into good or bad while minimizing the average cost. On the other hand, this comes at the price of reducing the accuracy and other performance measures. This is illustrated in the following figure:
In the figure above, the red line denotes the standard threshold of 0.5 which maximizes accuracy but gives a sub-optimal average credit cost. The blue line denotes the desired threshold that minimizes the cost, but now the accuracy is drastically reduced.
Generally, I would not be concerned about the reduced accuracy. Suppose however there is also an incentive to not only minimize the cost but also to maximize the precision as well. Note that the precision is the positive predictive value or ppv = TP/(TP+FP)). Then the green line might be a good trade-off that gives a relatively low cost and a relatively high ppv. Here, I plotted the green line as the average of the red and blue lines (both credit cost and ppv functions seem to have about the same gradient between these regions so calculating the optimal threshold this way probably provides a good estimate), but is there a way to calculate this threshold exactly?
My thoughts are to create a new performance measure as a function of both the costs and the ppv, and then minimize this performance measure.
Example: measure = credit.costs*(-ppv)
But I'm not sure how to code this in R. Any advice on what should be done would be greatly appreciated.
My R code is as follows:
library(mlr)
## Load dataset
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target = "Class")
## Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task = removeConstantFeatures(credit.task)
## Generate cost matrix
costs = matrix(c(0, 1, 5, 0), 2)
colnames(costs) = rownames(costs) = getTaskClassLevels(credit.task)
## Make cost measure
credit.costs = makeCostMeasure(id = "credit.costs", name = "Credit costs", costs = costs, best = 0, worst = 5)
## Set training scheme with repeated 10-fold cross-validation
set.seed(100)
rin = makeResampleInstance("RepCV", folds = 10, reps = 3, task = credit.task)
## Fit a logistic regression model (nnet::multinom())
lrn = makeLearner("classif.multinom", predict.type = "prob", trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures = list(credit.costs, mmce), show.info = FALSE)
r
# Tune the threshold using average costs based on the predicted probabilities on the 3 test data sets
cost_tune.res = tuneThreshold(pred = r$pred, measure = credit.costs)
# Tune the threshold using precision based on the predicted probabilities on the 3 test data sets
ppv_tune.res = tuneThreshold(pred = r$pred, measure = ppv)
d = generateThreshVsPerfData(r, measures = list(credit.costs, ppv, acc))
plt = plotThreshVsPerf(d)
plt + geom_vline(xintercept=cost_tune.res$th, colour = "blue") + geom_vline(xintercept=0.5, colour = "red") +
geom_vline(xintercept=1/2*(cost_tune.res$th + 0.5), colour = "green")
calculateConfusionMatrix(r$pred)
performance(r$pred, measures = list(acc, ppv, credit.costs))
Finally, I'm also a bit confused that about my ppv value. When I observe my confusion matrix, I am calculating my ppv as 442/(442+289) = 0.6046512 but the reported value is slightly different (0.6053531). Is there something wrong with my calculation?
An exam with 20 multiple choice question with P=0.25 how do I simulate a class of 100 students and what is the average of the class of students. If the class is increased to 1000 what happens to the average?
I'm not sure where to begin. Other than just try to solving this manually.
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100, size = 20, prob = 0.25 ))
I'm not sure what to do after this?
Well you just have to find a way to set the probability of answering correctly to 0.25, you can do that easily with the generation of a uniform distribution
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100000, size = 20, prob = 0.25 ))
Nstu=100000
Nquest=20
Results=matrix(as.numeric(runif(100000*20)<0.25),ncol=20)
hist(apply(Results,1,sum))
mean(apply(Results,1,sum))
from the definition of the binomial distribution:
the mean is defined to be n*p, so mu = 20*0.25, giving a mean of 5. this is independent of the class size
the variance is defined to be n*p*(1-p), and the standard deviation is the usual sqrt of this, so sigma = sqrt(20*0.25*0.75), i.e. ~1.94.
the standard error of the mean is sigma / sqrt(k), where k would be your class size. so we get SEMs of 0.19 and 0.061 for class sizes of 100 and 1000 respectively
it's often useful to check things via simulation, and we can simulate a single class as you were doing.
x <- rbinom(100, 20, 0.25)
plot(table(x))
I'm using plot(table(x)) above instead of using hist, because this is a discrete distribution. hist is more suited to continuous distributions, while table is better for discrete distributions with a small number of distinct values.
next, we can simulate things many times using replicate. in this case you're after the mean of the binomial draw:
y <- replicate(1000, mean(rbinom(100, 20, 0.25)))
c(mu=mean(y), se=sd(y))
which happened to give me mu=5.002 and se=0.201, but will change every run. increasing the class size to 1000, I get mu=5.002 again, and se=0.060. because these are random samples from the distribution they are subject to "monte-carlo error" but given enough replicates they should approach the analytical answers above. that said, they're close enough to the analytical results to give me confidence I've not made any silly typos
tmpSD = 10
tmpSD2 = 20
power.t.test(n=10,delta = 1*tmpSD,sd = tmpSD,sig.level=0.05,power=NULL,type="two.sample", alternative = c("two.sided"))
power.t.test(n=10,delta = 1*tmpSD2,sd = tmpSD2,sig.level=0.05,power=NULL,type="two.sample", alternative = c("two.sided"))
I have the above code in my R program. Both the first and second power.t.test result in the same power of 0.5619846. Does this mean that if my ratio of delta and standard deviation stay the same, so will my power?
EDIT:
In my test, I am running a power analysis to determine the minimum n needed to have 80% power in finding a statistically significant difference in Contribution and in Time. The standard deviations of the two are different. But, when running the following for loop to determine the power of each at different n's I obtain the exact same power at each n. I am nevertheless confused as to why the power in of each is identical at the same n. Why should my power to detect a statistically significant difference in decision time at a particular n be equal to my power to detect a statistically significant difference in contribution amount at the same n?
sdContribution #15.39155
sdTime #22.95667
for (i in seq(20,200,by=10)){
print(power.t.test(n=i,delta = 0.05*sdContribution,sd=sdContribution,sig.level = 0.05,power = NULL,type = "two.sample", alternative = c("two.sided")))
print(power.t.test(n=i,delta = 0.05*sdTime,sd=sdTime,sig.level = 0.05,power = NULL,type = "two.sample", alternative = c("two.sided")))}