Rejection Sampling to generate Normal samples from Cauchy samples - r

I tried my luck on coding a rejection sampling method to generate a sample that follows a normal distribution. The samples look like normal distributions on first glance but the p-value of the Shapiro-Wilk test is always <0.05. I don't really know where I turned wrong and I only got the pseudo-code from my teacher (its NOT homework). Any help is appreciated. Below my code:
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,m,sqrt(v))
}
g <- function(x,x0,lambda) { #cauchy distribution for sampling
dcauchy(x,x0,lambda)
}
genSamp <- function(n,m,v) { #I want the user to be able to choose mean and sd
#and size of the sample
stProbe <- rep(0,n) #the sample vector
interval = c(m-10*sqrt(v),m+10*sqrt(v)) #wanted to go sure that everything
#is covered, so I took a range
#that depends on the mean
M = max(f(interval,m,v)/g(interval,m,v)) #rescaling coefficient, so the cauchy distribution
#is never under the normal distribution
#I chose x0 = m and lambda = v, so the cauchy distribution is close to a
#the target normal distribution
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,max(f(interval,m,v)))
if(u < (f(x,m,v)/(M*g(x,m,v)))) {
break
}
}
stProbe[i] <- x
}
return(stProbe)
}
Then I tried it out with:
test <- genSamp(100,2,0.5)
hist(test,prob=T,breaks=30)#looked not bad
shapiro.test(test) #p-value way below 0.05
Thank you in advance for your help.

Actually, the first thing I checked is sample mean and sample variance. When I draw 1000 samples with your genSamp, I get sample mean at 2, but sample variance at about 2.64, far from the target 0.5.
The 1st problem is with your computation of M. Note that:
interval = c(m - 10 * sqrt(v), m + 10 * sqrt(v))
only gives you 2 values, rather than a grid of equally spaced points on the interval. At 10 standard deviation away from the mean, the Normal density is almost 0, so M is almost 0. You need to do something like
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
The 2nd problem is the generation of uniform random variable in your repeat. Why do you do
u <- runif(1,0,max(f(interval,m,v)))
You want
u <- runif(1, 0, 1)
With these fixes, I have tested that genSamp gets the correct sample mean and sample variance. The samples pass both Shapiro–Wilk test and Kolmogorov-Smirnov test (?ks.test).
Full working code
f <- function(x,m,v) dnorm(x,m,sqrt(v))
g <- function(x,x0,lambda) dcauchy(x,x0,lambda)
genSamp <- function(n,m,v) {
stProbe <- rep(0,n)
interval <- seq(m - 10 * sqrt(v), m + 10 * sqrt(v), by = 0.01)
M = max(f(interval,m,v)/g(interval,m,v))
for (i in 1:n) {
repeat{
x <- rcauchy(1,m,v)
u <- runif(1,0,1)
if(u < (f(x,m,v)/(M*g(x,m,v)))) break
}
stProbe[i] <- x
}
return(stProbe)
}
set.seed(0)
test <- genSamp(1000, 2, 0.5)
shapiro.test(test)$p.value
#[1] 0.1563038
ks.test(test, rnorm(1000, 2, sqrt(0.5)))$p.value
#[1] 0.7590978

You have
f <- function(x,m,v) { #target distribution, m=mean,v=variance
dnorm(x,e,sqrt(v))
}
which samples with mean e, but that is never defined.

Related

Calculate expected value of variance using monte carlo simulation

So I have this probability distribution
X = {0      probability 7/8}
      {1/60 probability 1/8}
James his car breaks down N times a year where N ~ Pois(2) and X the repair cost and Y is the total cost caused by James in a year.
I want to calculate the E[Y] and V(Y), which should give me E[X]=15 and V(Y) = 1800
I have this monte Carlo simulation:
expon_dis <- rexp(200, 1/60)
result_matrix2 <- rep(0, 200)
expected_matrix <- rep(0, runs)
for (u in 1:runs){
expon_dis <- rexp(200, 1/60)
N <- rpois(200, 2)
for (l in 1:200){
result_matrix2[l] <- (expon_dis[l] * (1/8)) * (N[l])
}
expected_matrix[u] <- mean(result_matrix2)
}
This code gives the expected value of 15 but the variance is not correct. So what is wrong with this simulation?
Not enough time to read through your code, but i think the error comes with the multiplication.
Below is a very rough implementation, where first you write a function to simulate the cost, given x number of breakdowns:
sim_cost = function(x){
cost = rexp(x,1/60)
prob = sample(c(0,1/60),x,prob=c(7/8,1/8),replace=TRUE)
sum(cost[prob>0])
}
Then generate the number of breakdowns per year:
set.seed(111)
N <- rpois(500000, 2)
Iterate over the years, if 0, we return 0:
set.seed(111)
sim = sapply(N,function(i)if(i==0){0}else{sum(sim_cost(i))})
mean(sim)
[1] 14.98248
var(sim)
[1] 1797.549
You need quite a number of simulations, but above should be a code that you can start to optimize to get it closer.

Maximum likelihood estimation of a multivariate normal distribution of arbitrary dimesion in R - THE ULTIMATE GUIDE?

I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks

one sample hypothesis test for proportions

I'm looking for a built-in R function that calculates the power of a one sample hypothesis test for proportions.
The built in function power.prop.test only does TWO SAMPLE hypothesis tests for proportions.
The original question is: "How many times do you have to toss a coin to determine that it is biased?
p.null <- 0.5 # null hypothesis.
We say that a coin is "biased" if the probability of tossing heads is either
greater than 0.51 or less than 0.49. Otherwise we say that it is "good enough"
delta <- 0.01
Here is a function to toss a biased coin N times and return the proportion of heads:
biased.coin <- function(delta, N) {
probs <- runif(N, 0, 1)
heads <- probs[probs < 0.5+delta]
return(length(heads)/N)
}
We fix alpha and beta throughout at the standard values. Our goal is to calculate N.
alpha = 0.05 # 95% confidence interval
beta = 0.8 # Correctly reject the null hypothesis 80% of time.
The first step is to use a simulation.
A single experiment is to toss the coin N times and reject the null hypothesis if the number of heads deviates "too far" from the expected value of N/2
We then repeat the experiment M times and count how many times the null hypothesis is (correctly) rejected.
M <- 1000
simulate.power <- function(delta, N, p.null, M, alpha) {
print(paste("Calculating power for N =", N))
reject <- c()
se <- sqrt(p.null*(1-p.null))/sqrt(N)
for (i in (1:M)) {
heads <- biased.coin(delta, N) # perform an experiment
z <- (heads - p.null)/se # z-score
p.value <- pnorm(-abs(z)) # p-value
reject[i] <- p.value < alpha/2 # Do we rejct the null?
}
return(sum(reject)/M) # proportion of time null was rejected.
}
Next we plot a graph (slow, about 5 minutes):
ns <- seq(1000, 50000, by=1000)
my.pwr <- c()
for (i in (1:length(ns))) {
my.pwr[i] <- simulate.power(delta, ns[i], p.null, M, alpha)
}
plot(ns, my.pwr)
From the graph it looks like the N you need for a power of beta = 0.8 is about 20000.
The simulation is very slow so it would be nice to have a built in function.
A little fiddling around gave me this:
magic <- function(p.null, delta, alpha, N) {
magic <-power.prop.test(p1=p.null,
p2=p.null+delta,
sig.level=alpha,
###################################
n=2*N, # mysterious 2
###################################
alternative="two.sided",
strict=FALSE)
return(magic[["power"]])
}
Let's plot it against our simulated data.
pwr.magic <- c()
for (i in (1:length(ns))) {
pwr.magic[i] <- magic(p.null, delta, alpha, ns[i])
}
points(ns, pwr.magic, pch=20)
The fit is good, but I have no idea why I would need to multiply N by two,
in order to get a one sample power out of a two sample proportion test.
It would be nice if there were a built in function that let you do one sample directly.
Thanks!
You could try
library(pwr)
h <- ES.h(0.51, 0.5) # Compute effect size h for two proportions
pwr.p.test(h = h, n = NULL, sig.level = 0.05, power = 0.8, alternative = "two.sided")
# proportion power calculation for binomial distribution (arcsine transformation)
# h = 0.02000133
# n = 19619.53
# sig.level = 0.05
# power = 0.8
# alternative = two.sided
As an aside, one way to speed up your simulation significantly would be to use rbinom instead of runif:
biased.coin2 <- function(delta, N) {
rbinom(1, N, 0.5 + delta) / N
}

How to implement Aly's permutation test for comparison of variances in R?

The excerpt below is from "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed. by Phillip Good (pages 58-61), section 3.7.2..
I am trying to implement this permutation test in R (see further below) to compare two variances. I am thinking now about how to calculate the p-value, and whether the test allows for different alternative hypothesis (greater, less, two-sided) and I am not sure on how to proceed.
Could you shed some light on this and perhaps give me some criticism about the code? Many thanks!
# Aly's non-parametric, permutation test of equality of variances
# From "Permutation, Parametric and Bootstrap Tests of Hypotheses", Third Ed.
# by Phillip Good (pages 58-61), section 3.7.2.
# Implementation of delta statistic as defined by formula in page 60
# x_{i}, order statistics
# z = x_{i+1} - x_{i}, differences between successive order statistics
aly_delta_statistic <- function(z) {
z_length <- length(z)
m <- z_length + 1
i <- 1:z_length
sum(i*(m-i)*z)
}
aly_test_statistic <- function(sample1, sample2 = NULL, nperm = 1) {
# compute statistic based on one sample only: sample1
if(is.null(sample2)) {
sample1 <- sort(sample1)
z <- diff(sample1)
return(aly_delta_statistic(z))
}
# statistic based on randomization of the two samples
else {
m1 <- length(sample1)
m2 <- length(sample2)
# allocate a vector to save the statistic delta
statistic <- vector(mode = "numeric", length = nperm)
for(j in 1:nperm) {
# 1st stage resampling (performed only if samples sizes are different)
# larger sample is resized to the size of the smaller
if(m2 > m1) {
sample2 <- sort(sample(sample2, m1))
m <- m1
} else {
sample1 <- sort(sample(sample1, m2))
m <- m2
}
# z-values: z1 in column 1 and z2 in column 2.
z_two_samples <- matrix(c(diff(sample1), diff(sample2)), ncol = 2)
# 2nd stage resampling
z <- apply(z_two_samples, 1, sample, 1)
statistic[j] <- aly_delta_statistic(z)
}
return(statistic)
}
}

Estimating the Standard Deviation of a ratio using Taylor expansion

I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)

Resources