I'm looking for a built-in R function that calculates the power of a one sample hypothesis test for proportions.
The built in function power.prop.test only does TWO SAMPLE hypothesis tests for proportions.
The original question is: "How many times do you have to toss a coin to determine that it is biased?
p.null <- 0.5 # null hypothesis.
We say that a coin is "biased" if the probability of tossing heads is either
greater than 0.51 or less than 0.49. Otherwise we say that it is "good enough"
delta <- 0.01
Here is a function to toss a biased coin N times and return the proportion of heads:
biased.coin <- function(delta, N) {
probs <- runif(N, 0, 1)
heads <- probs[probs < 0.5+delta]
return(length(heads)/N)
}
We fix alpha and beta throughout at the standard values. Our goal is to calculate N.
alpha = 0.05 # 95% confidence interval
beta = 0.8 # Correctly reject the null hypothesis 80% of time.
The first step is to use a simulation.
A single experiment is to toss the coin N times and reject the null hypothesis if the number of heads deviates "too far" from the expected value of N/2
We then repeat the experiment M times and count how many times the null hypothesis is (correctly) rejected.
M <- 1000
simulate.power <- function(delta, N, p.null, M, alpha) {
print(paste("Calculating power for N =", N))
reject <- c()
se <- sqrt(p.null*(1-p.null))/sqrt(N)
for (i in (1:M)) {
heads <- biased.coin(delta, N) # perform an experiment
z <- (heads - p.null)/se # z-score
p.value <- pnorm(-abs(z)) # p-value
reject[i] <- p.value < alpha/2 # Do we rejct the null?
}
return(sum(reject)/M) # proportion of time null was rejected.
}
Next we plot a graph (slow, about 5 minutes):
ns <- seq(1000, 50000, by=1000)
my.pwr <- c()
for (i in (1:length(ns))) {
my.pwr[i] <- simulate.power(delta, ns[i], p.null, M, alpha)
}
plot(ns, my.pwr)
From the graph it looks like the N you need for a power of beta = 0.8 is about 20000.
The simulation is very slow so it would be nice to have a built in function.
A little fiddling around gave me this:
magic <- function(p.null, delta, alpha, N) {
magic <-power.prop.test(p1=p.null,
p2=p.null+delta,
sig.level=alpha,
###################################
n=2*N, # mysterious 2
###################################
alternative="two.sided",
strict=FALSE)
return(magic[["power"]])
}
Let's plot it against our simulated data.
pwr.magic <- c()
for (i in (1:length(ns))) {
pwr.magic[i] <- magic(p.null, delta, alpha, ns[i])
}
points(ns, pwr.magic, pch=20)
The fit is good, but I have no idea why I would need to multiply N by two,
in order to get a one sample power out of a two sample proportion test.
It would be nice if there were a built in function that let you do one sample directly.
Thanks!
You could try
library(pwr)
h <- ES.h(0.51, 0.5) # Compute effect size h for two proportions
pwr.p.test(h = h, n = NULL, sig.level = 0.05, power = 0.8, alternative = "two.sided")
# proportion power calculation for binomial distribution (arcsine transformation)
# h = 0.02000133
# n = 19619.53
# sig.level = 0.05
# power = 0.8
# alternative = two.sided
As an aside, one way to speed up your simulation significantly would be to use rbinom instead of runif:
biased.coin2 <- function(delta, N) {
rbinom(1, N, 0.5 + delta) / N
}
Related
So I have this probability distribution
X = {0 probability 7/8}
{1/60 probability 1/8}
James his car breaks down N times a year where N ~ Pois(2) and X the repair cost and Y is the total cost caused by James in a year.
I want to calculate the E[Y] and V(Y), which should give me E[X]=15 and V(Y) = 1800
I have this monte Carlo simulation:
expon_dis <- rexp(200, 1/60)
result_matrix2 <- rep(0, 200)
expected_matrix <- rep(0, runs)
for (u in 1:runs){
expon_dis <- rexp(200, 1/60)
N <- rpois(200, 2)
for (l in 1:200){
result_matrix2[l] <- (expon_dis[l] * (1/8)) * (N[l])
}
expected_matrix[u] <- mean(result_matrix2)
}
This code gives the expected value of 15 but the variance is not correct. So what is wrong with this simulation?
Not enough time to read through your code, but i think the error comes with the multiplication.
Below is a very rough implementation, where first you write a function to simulate the cost, given x number of breakdowns:
sim_cost = function(x){
cost = rexp(x,1/60)
prob = sample(c(0,1/60),x,prob=c(7/8,1/8),replace=TRUE)
sum(cost[prob>0])
}
Then generate the number of breakdowns per year:
set.seed(111)
N <- rpois(500000, 2)
Iterate over the years, if 0, we return 0:
set.seed(111)
sim = sapply(N,function(i)if(i==0){0}else{sum(sim_cost(i))})
mean(sim)
[1] 14.98248
var(sim)
[1] 1797.549
You need quite a number of simulations, but above should be a code that you can start to optimize to get it closer.
I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks
I tried my luck on coding a rejection sampling method to generate a sample that follows a normal distribution. The samples look like normal distributions on first glance but the p-value of the Shapiro-Wilk test is always <0.05. I don't really know where I turned wrong and I only got the pseudo-code from my teacher (its NOT homework). Any help is appreciated. Below my code:
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,m,sqrt(v))
}
g <- function(x,x0,lambda) { #cauchy distribution for sampling
dcauchy(x,x0,lambda)
}
genSamp <- function(n,m,v) { #I want the user to be able to choose mean and sd
#and size of the sample
stProbe <- rep(0,n) #the sample vector
interval = c(m-10*sqrt(v),m+10*sqrt(v)) #wanted to go sure that everything
#is covered, so I took a range
#that depends on the mean
M = max(f(interval,m,v)/g(interval,m,v)) #rescaling coefficient, so the cauchy distribution
#is never under the normal distribution
#I chose x0 = m and lambda = v, so the cauchy distribution is close to a
#the target normal distribution
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,max(f(interval,m,v)))
if(u < (f(x,m,v)/(M*g(x,m,v)))) {
break
}
}
stProbe[i] <- x
}
return(stProbe)
}
Then I tried it out with:
test <- genSamp(100,2,0.5)
hist(test,prob=T,breaks=30)#looked not bad
shapiro.test(test) #p-value way below 0.05
Thank you in advance for your help.
Actually, the first thing I checked is sample mean and sample variance. When I draw 1000 samples with your genSamp, I get sample mean at 2, but sample variance at about 2.64, far from the target 0.5.
The 1st problem is with your computation of M. Note that:
interval = c(m - 10 * sqrt(v), m + 10 * sqrt(v))
only gives you 2 values, rather than a grid of equally spaced points on the interval. At 10 standard deviation away from the mean, the Normal density is almost 0, so M is almost 0. You need to do something like
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
The 2nd problem is the generation of uniform random variable in your repeat. Why do you do
u <- runif(1,0,max(f(interval,m,v)))
You want
u <- runif(1, 0, 1)
With these fixes, I have tested that genSamp gets the correct sample mean and sample variance. The samples pass both Shapiro–Wilk test and Kolmogorov-Smirnov test (?ks.test).
Full working code
f <- function(x,m,v) dnorm(x,m,sqrt(v))
g <- function(x,x0,lambda) dcauchy(x,x0,lambda)
genSamp <- function(n,m,v) {
stProbe <- rep(0,n)
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
M = max(f(interval,m,v)/g(interval,m,v))
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,1)
if(u < (f(x,m,v)/(M*g(x,m,v)))) break
}
stProbe[i] <- x
}
return(stProbe)
}
set.seed(0)
test <- genSamp(1000, 2, 0.5)
shapiro.test(test)$p.value
#[1] 0.1563038
ks.test(test, rnorm(1000, 2, sqrt(0.5)))$p.value
#[1] 0.7590978
You have
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,e,sqrt(v))
}
which samples with mean e, but that is never defined.
I was wondering how I could check via simulation in R that the 95% Confidence Interval obtained from a binomial test with 5 successes in 15 trials when TRUE p = .5 has a 95% "Coverage Probability" in the long-run?
Here is the 95% CI for such a test using R (how can show that the following CI has a 95% coverage in the long-run if TRUE p = .5):
as.numeric(binom.test(x = 5, n = 15, p = .5)[[4]])
# > [1] 0.1182411 0.6161963 (in the long-run 95% of the time, ".5" is contained within these
# two numbers, how to show this in R?)
Something like this?
fun <- function(n = 15, p = 0.5){
x <- rbinom(1, size = n, prob = p)
res <- binom.test(x, n, p)[[4]]
c(Lower = res[1], Upper = res[2])
}
set.seed(3183)
R <- 10000
sim <- t(replicate(R, fun()))
Note that binom.test when called with 5 successes, 15 trials and p = 0.5 will always return the same value, hence the call to rbinom. The number of successes will vary. We can compute the proportion of cases when p is between Lower and Upper.
cov <- mean(sim[,1] <= .5 & .5 <= sim[,2])
The excerpt below is from "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed. by Phillip Good (pages 58-61), section 3.7.2..
I am trying to implement this permutation test in R (see further below) to compare two variances. I am thinking now about how to calculate the p-value, and whether the test allows for different alternative hypothesis (greater, less, two-sided) and I am not sure on how to proceed.
Could you shed some light on this and perhaps give me some criticism about the code? Many thanks!
# Aly's non-parametric, permutation test of equality of variances
# From "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed.
# by Phillip Good (pages 58-61), section 3.7.2.
# Implementation of delta statistic as defined by formula in page 60
# x_{i}, order statistics
# z = x_{i+1} - x_{i}, differences between successive order statistics
aly_delta_statistic <- function(z) {
z_length <- length(z)
m <- z_length + 1
i <- 1:z_length
sum(i*(m-i)*z)
}
aly_test_statistic <- function(sample1, sample2 = NULL, nperm = 1) {
# compute statistic based on one sample only: sample1
if(is.null(sample2)) {
sample1 <- sort(sample1)
z <- diff(sample1)
return(aly_delta_statistic(z))
}
# statistic based on randomization of the two samples
else {
m1 <- length(sample1)
m2 <- length(sample2)
# allocate a vector to save the statistic delta
statistic <- vector(mode = "numeric", length = nperm)
for(j in 1:nperm) {
# 1st stage resampling (performed only if samples sizes are different)
# larger sample is resized to the size of the smaller
if(m2 > m1) {
sample2 <- sort(sample(sample2, m1))
m <- m1
} else {
sample1 <- sort(sample(sample1, m2))
m <- m2
}
# z-values: z1 in column 1 and z2 in column 2.
z_two_samples <- matrix(c(diff(sample1), diff(sample2)), ncol = 2)
# 2nd stage resampling
z <- apply(z_two_samples, 1, sample, 1)
statistic[j] <- aly_delta_statistic(z)
}
return(statistic)
}
}