one sample hypothesis test for proportions - r

I'm looking for a built-in R function that calculates the power of a one sample hypothesis test for proportions.
The built in function power.prop.test only does TWO SAMPLE hypothesis tests for proportions.
The original question is: "How many times do you have to toss a coin to determine that it is biased?
p.null <- 0.5 # null hypothesis.
We say that a coin is "biased" if the probability of tossing heads is either
greater than 0.51 or less than 0.49. Otherwise we say that it is "good enough"
delta <- 0.01
Here is a function to toss a biased coin N times and return the proportion of heads:
biased.coin <- function(delta, N) {
probs <- runif(N, 0, 1)
heads <- probs[probs < 0.5+delta]
return(length(heads)/N)
}
We fix alpha and beta throughout at the standard values. Our goal is to calculate N.
alpha = 0.05 # 95% confidence interval
beta = 0.8 # Correctly reject the null hypothesis 80% of time.
The first step is to use a simulation.
A single experiment is to toss the coin N times and reject the null hypothesis if the number of heads deviates "too far" from the expected value of N/2
We then repeat the experiment M times and count how many times the null hypothesis is (correctly) rejected.
M <- 1000
simulate.power <- function(delta, N, p.null, M, alpha) {
print(paste("Calculating power for N =", N))
reject <- c()
se <- sqrt(p.null*(1-p.null))/sqrt(N)
for (i in (1:M)) {
heads <- biased.coin(delta, N) # perform an experiment
z <- (heads - p.null)/se # z-score
p.value <- pnorm(-abs(z)) # p-value
reject[i] <- p.value < alpha/2 # Do we rejct the null?
}
return(sum(reject)/M) # proportion of time null was rejected.
}
Next we plot a graph (slow, about 5 minutes):
ns <- seq(1000, 50000, by=1000)
my.pwr <- c()
for (i in (1:length(ns))) {
my.pwr[i] <- simulate.power(delta, ns[i], p.null, M, alpha)
}
plot(ns, my.pwr)
From the graph it looks like the N you need for a power of beta = 0.8 is about 20000.
The simulation is very slow so it would be nice to have a built in function.
A little fiddling around gave me this:
magic <- function(p.null, delta, alpha, N) {
magic <-power.prop.test(p1=p.null,
p2=p.null+delta,
sig.level=alpha,
###################################
n=2*N, # mysterious 2
###################################
alternative="two.sided",
strict=FALSE)
return(magic[["power"]])
}
Let's plot it against our simulated data.
pwr.magic <- c()
for (i in (1:length(ns))) {
pwr.magic[i] <- magic(p.null, delta, alpha, ns[i])
}
points(ns, pwr.magic, pch=20)
The fit is good, but I have no idea why I would need to multiply N by two,
in order to get a one sample power out of a two sample proportion test.
It would be nice if there were a built in function that let you do one sample directly.
Thanks!

You could try
library(pwr)
h <- ES.h(0.51, 0.5) # Compute effect size h for two proportions
pwr.p.test(h = h, n = NULL, sig.level = 0.05, power = 0.8, alternative = "two.sided")
# proportion power calculation for binomial distribution (arcsine transformation)
# h = 0.02000133
# n = 19619.53
# sig.level = 0.05
# power = 0.8
# alternative = two.sided
As an aside, one way to speed up your simulation significantly would be to use rbinom instead of runif:
biased.coin2 <- function(delta, N) {
rbinom(1, N, 0.5 + delta) / N
}

Related

Calculate expected value of variance using monte carlo simulation

So I have this probability distribution
X = {0      probability 7/8}
      {1/60 probability 1/8}
James his car breaks down N times a year where N ~ Pois(2) and X the repair cost and Y is the total cost caused by James in a year.
I want to calculate the E[Y] and V(Y), which should give me E[X]=15 and V(Y) = 1800
I have this monte Carlo simulation:
expon_dis <- rexp(200, 1/60)
result_matrix2 <- rep(0, 200)
expected_matrix <- rep(0, runs)
for (u in 1:runs){
expon_dis <- rexp(200, 1/60)
N <- rpois(200, 2)
for (l in 1:200){
result_matrix2[l] <- (expon_dis[l] * (1/8)) * (N[l])
}
expected_matrix[u] <- mean(result_matrix2)
}
This code gives the expected value of 15 but the variance is not correct. So what is wrong with this simulation?
Not enough time to read through your code, but i think the error comes with the multiplication.
Below is a very rough implementation, where first you write a function to simulate the cost, given x number of breakdowns:
sim_cost = function(x){
cost = rexp(x,1/60)
prob = sample(c(0,1/60),x,prob=c(7/8,1/8),replace=TRUE)
sum(cost[prob>0])
}
Then generate the number of breakdowns per year:
set.seed(111)
N <- rpois(500000, 2)
Iterate over the years, if 0, we return 0:
set.seed(111)
sim = sapply(N,function(i)if(i==0){0}else{sum(sim_cost(i))})
mean(sim)
[1] 14.98248
var(sim)
[1] 1797.549
You need quite a number of simulations, but above should be a code that you can start to optimize to get it closer.

Maximum likelihood estimation of a multivariate normal distribution of arbitrary dimesion in R - THE ULTIMATE GUIDE?

I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks

Rejection Sampling to generate Normal samples from Cauchy samples

I tried my luck on coding a rejection sampling method to generate a sample that follows a normal distribution. The samples look like normal distributions on first glance but the p-value of the Shapiro-Wilk test is always <0.05. I don't really know where I turned wrong and I only got the pseudo-code from my teacher (its NOT homework). Any help is appreciated. Below my code:
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,m,sqrt(v))
}
g <- function(x,x0,lambda) { #cauchy distribution for sampling
dcauchy(x,x0,lambda)
}
genSamp <- function(n,m,v) { #I want the user to be able to choose mean and sd
#and size of the sample
stProbe <- rep(0,n) #the sample vector
interval = c(m-10*sqrt(v),m+10*sqrt(v)) #wanted to go sure that everything
#is covered, so I took a range
#that depends on the mean
M = max(f(interval,m,v)/g(interval,m,v)) #rescaling coefficient, so the cauchy distribution
#is never under the normal distribution
#I chose x0 = m and lambda = v, so the cauchy distribution is close to a
#the target normal distribution
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,max(f(interval,m,v)))
if(u < (f(x,m,v)/(M*g(x,m,v)))) {
break
}
}
stProbe[i] <- x
}
return(stProbe)
}
Then I tried it out with:
test <- genSamp(100,2,0.5)
hist(test,prob=T,breaks=30)#looked not bad
shapiro.test(test) #p-value way below 0.05
Thank you in advance for your help.
Actually, the first thing I checked is sample mean and sample variance. When I draw 1000 samples with your genSamp, I get sample mean at 2, but sample variance at about 2.64, far from the target 0.5.
The 1st problem is with your computation of M. Note that:
interval = c(m - 10 * sqrt(v), m + 10 * sqrt(v))
only gives you 2 values, rather than a grid of equally spaced points on the interval. At 10 standard deviation away from the mean, the Normal density is almost 0, so M is almost 0. You need to do something like
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
The 2nd problem is the generation of uniform random variable in your repeat. Why do you do
u <- runif(1,0,max(f(interval,m,v)))
You want
u <- runif(1, 0, 1)
With these fixes, I have tested that genSamp gets the correct sample mean and sample variance. The samples pass both Shapiro–Wilk test and Kolmogorov-Smirnov test (?ks.test).
Full working code
f <- function(x,m,v) dnorm(x,m,sqrt(v))
g <- function(x,x0,lambda) dcauchy(x,x0,lambda)
genSamp <- function(n,m,v) {
stProbe <- rep(0,n)
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
M = max(f(interval,m,v)/g(interval,m,v))
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,1)
if(u < (f(x,m,v)/(M*g(x,m,v)))) break
}
stProbe[i] <- x
}
return(stProbe)
}
set.seed(0)
test <- genSamp(1000, 2, 0.5)
shapiro.test(test)$p.value
#[1] 0.1563038
ks.test(test, rnorm(1000, 2, sqrt(0.5)))$p.value
#[1] 0.7590978
You have
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,e,sqrt(v))
}
which samples with mean e, but that is never defined.

Confidence Interval (CI) simulation in R: How?

I was wondering how I could check via simulation in R that the 95% Confidence Interval obtained from a binomial test with 5 successes in 15 trials when TRUE p = .5 has a 95% "Coverage Probability" in the long-run?
Here is the 95% CI for such a test using R (how can show that the following CI has a 95% coverage in the long-run if TRUE p = .5):
as.numeric(binom.test(x = 5, n = 15, p = .5)[[4]])
# > [1] 0.1182411 0.6161963 (in the long-run 95% of the time, ".5" is contained within these
# two numbers, how to show this in R?)
Something like this?
fun <- function(n = 15, p = 0.5){
x <- rbinom(1, size = n, prob = p)
res <- binom.test(x, n, p)[[4]]
c(Lower = res[1], Upper = res[2])
}
set.seed(3183)
R <- 10000
sim <- t(replicate(R, fun()))
Note that binom.test when called with 5 successes, 15 trials and p = 0.5 will always return the same value, hence the call to rbinom. The number of successes will vary. We can compute the proportion of cases when p is between Lower and Upper.
cov <- mean(sim[,1] <= .5 & .5 <= sim[,2])

How to implement Aly's permutation test for comparison of variances in R?

The excerpt below is from "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed. by Phillip Good (pages 58-61), section 3.7.2..
I am trying to implement this permutation test in R (see further below) to compare two variances. I am thinking now about how to calculate the p-value, and whether the test allows for different alternative hypothesis (greater, less, two-sided) and I am not sure on how to proceed.
Could you shed some light on this and perhaps give me some criticism about the code? Many thanks!
# Aly's non-parametric, permutation test of equality of variances
# From "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed.
# by Phillip Good (pages 58-61), section 3.7.2.
# Implementation of delta statistic as defined by formula in page 60
# x_{i}, order statistics
# z = x_{i+1} - x_{i}, differences between successive order statistics
aly_delta_statistic <- function(z) {
z_length <- length(z)
m <- z_length + 1
i <- 1:z_length
sum(i*(m-i)*z)
}
aly_test_statistic <- function(sample1, sample2 = NULL, nperm = 1) {
# compute statistic based on one sample only: sample1
if(is.null(sample2)) {
sample1 <- sort(sample1)
z <- diff(sample1)
return(aly_delta_statistic(z))
}
# statistic based on randomization of the two samples
else {
m1 <- length(sample1)
m2 <- length(sample2)
# allocate a vector to save the statistic delta
statistic <- vector(mode = "numeric", length = nperm)
for(j in 1:nperm) {
# 1st stage resampling (performed only if samples sizes are different)
# larger sample is resized to the size of the smaller
if(m2 > m1) {
sample2 <- sort(sample(sample2, m1))
m <- m1
} else {
sample1 <- sort(sample(sample1, m2))
m <- m2
}
# z-values: z1 in column 1 and z2 in column 2.
z_two_samples <- matrix(c(diff(sample1), diff(sample2)), ncol = 2)
# 2nd stage resampling
z <- apply(z_two_samples, 1, sample, 1)
statistic[j] <- aly_delta_statistic(z)
}
return(statistic)
}
}

Resources