CTL - Central limit theorem, how to use it in R? - r

I've learned the CTL, and I have a question.
There is an average of 100 observations when the expectation is M and the Variance is 9.
I need to find the a&b Top block and bottom block, that the probability will be bigger than 0.9.
p( a <= x(100)-M <= b ) >= 0.9
the X(100) is x with 100 tries.
How I do this in R? I can't understand.. I wrote something like this -
numt <- 1:100
cbind(numt , 1-2*pnorm(-b/3))
but I understand it doesn't work well.

I am not sure I understand if the following is what the question asks for.
The following function takes as input
M - expectation,
Var - variance,
n - the sample size,
prob the two-sided probability.
Arguments Var, n and prob have the defaults in the question, so this can be seen as a function of M only.
fun <- function(M, Var = 9, n = 100, prob = 0.90){
p <- c((1 - prob)/2, 1 - (1 - prob)/2)
qq <- qnorm(p, mean = M, sd = sqrt(Var/n))
setNames(qq, c("lower", "upper"))
}
M <- 1
fun(M)
# lower upper
#0.5065439 1.4934561

Related

In R, given y = f(x), how do I find the global max of y/x?

I have created a curve that shows for each level of spend (X) a unique output in revenue (Y).
The curve is defined by the following (monotone) function:
calculate_abc_revenue <- function(a, b, c, spend) {
res <- ifelse(
a/(1+b*(spend)^c) >= 0,
a/(1+b*(spend)^c),
0
)
return(res)
}
Where a, b and c are given parameters and should be treated as constant:
a0 <- 1303066.36937866
b0 <- 15560519.9999999
c0 <- -1.09001859302511
Now, if we define ROI as:
revenue <- calculate_abc_revenue(a = a0, b = b0, c = c0, spend)
ROI <- revenue/spend
How do I find the exact values of revenue and spend that make ROI max?
I currently use a spend vector of length n that helps me finding approximately the max ROI, but most of the times the result is not 100% exact as the real max ROI can fall between two points sent as input.
I would like to avoid to increase the length of the spend vector as it would increase the calculation time (and it wouldn't guarantee that the solution found is a global max anyway).
By setting ROI as a function, we can use optimize:
ROI <- function(spend) calculate_abc_revenue(a0, b0, c0, spend)/spend
optspend <- optimize(ROI, c(0, 1e12), maximum = TRUE)$maximum
c("optimal spend" = optspend, revenue = calculate_abc_revenue(a0, b0, c0, optspend))
#> optimal spend revenue
#> 435274.1 107613.0
Here is another approach. It uses unitroot applied to the derivative of the ROI-function f. (only valid for spend > 0)
From this function the derivative is build from which a function is created. This function is then fed into unitroot.
Plots in the upper row show ROI over a wide range of investment and its derivative. The lower row shows a zoomed version of those.
#| function related to investment (spend) ROI
f = expression(1 / spend * (a0 / (1 + b0 * spend^c0)))
#| get the derivative
ff_2 <- D(f, 'spend')
#| uniroot
ff_2 <- function(spend){}
body(ff_2) <- ff
res <- uniroot(ff_2, c(1e5, 1e6))
c("optimal spend" = res$root, revenue = calculate_abc_revenue(spend = res$root))
# optimal spend revenue
# 435274.1 107613.0
Original Data
a0 <- 1303066.36937866
b0 <- 15560519.9999999
c0 <- -1.09001859302511
#| Original function
calculate_abc_revenue <- function(a = a0,b = b0, c = c0,spend) {
res <- ifelse(
a/(1+b*(spend)^c) >= 0,
a/(1+b*(spend)^c),
0
)
return(res)
}

one sample hypothesis test for proportions

I'm looking for a built-in R function that calculates the power of a one sample hypothesis test for proportions.
The built in function power.prop.test only does TWO SAMPLE hypothesis tests for proportions.
The original question is: "How many times do you have to toss a coin to determine that it is biased?
p.null <- 0.5 # null hypothesis.
We say that a coin is "biased" if the probability of tossing heads is either
greater than 0.51 or less than 0.49. Otherwise we say that it is "good enough"
delta <- 0.01
Here is a function to toss a biased coin N times and return the proportion of heads:
biased.coin <- function(delta, N) {
probs <- runif(N, 0, 1)
heads <- probs[probs < 0.5+delta]
return(length(heads)/N)
}
We fix alpha and beta throughout at the standard values. Our goal is to calculate N.
alpha = 0.05 # 95% confidence interval
beta = 0.8 # Correctly reject the null hypothesis 80% of time.
The first step is to use a simulation.
A single experiment is to toss the coin N times and reject the null hypothesis if the number of heads deviates "too far" from the expected value of N/2
We then repeat the experiment M times and count how many times the null hypothesis is (correctly) rejected.
M <- 1000
simulate.power <- function(delta, N, p.null, M, alpha) {
print(paste("Calculating power for N =", N))
reject <- c()
se <- sqrt(p.null*(1-p.null))/sqrt(N)
for (i in (1:M)) {
heads <- biased.coin(delta, N) # perform an experiment
z <- (heads - p.null)/se # z-score
p.value <- pnorm(-abs(z)) # p-value
reject[i] <- p.value < alpha/2 # Do we rejct the null?
}
return(sum(reject)/M) # proportion of time null was rejected.
}
Next we plot a graph (slow, about 5 minutes):
ns <- seq(1000, 50000, by=1000)
my.pwr <- c()
for (i in (1:length(ns))) {
my.pwr[i] <- simulate.power(delta, ns[i], p.null, M, alpha)
}
plot(ns, my.pwr)
From the graph it looks like the N you need for a power of beta = 0.8 is about 20000.
The simulation is very slow so it would be nice to have a built in function.
A little fiddling around gave me this:
magic <- function(p.null, delta, alpha, N) {
magic <-power.prop.test(p1=p.null,
p2=p.null+delta,
sig.level=alpha,
###################################
n=2*N, # mysterious 2
###################################
alternative="two.sided",
strict=FALSE)
return(magic[["power"]])
}
Let's plot it against our simulated data.
pwr.magic <- c()
for (i in (1:length(ns))) {
pwr.magic[i] <- magic(p.null, delta, alpha, ns[i])
}
points(ns, pwr.magic, pch=20)
The fit is good, but I have no idea why I would need to multiply N by two,
in order to get a one sample power out of a two sample proportion test.
It would be nice if there were a built in function that let you do one sample directly.
Thanks!
You could try
library(pwr)
h <- ES.h(0.51, 0.5) # Compute effect size h for two proportions
pwr.p.test(h = h, n = NULL, sig.level = 0.05, power = 0.8, alternative = "two.sided")
# proportion power calculation for binomial distribution (arcsine transformation)
# h = 0.02000133
# n = 19619.53
# sig.level = 0.05
# power = 0.8
# alternative = two.sided
As an aside, one way to speed up your simulation significantly would be to use rbinom instead of runif:
biased.coin2 <- function(delta, N) {
rbinom(1, N, 0.5 + delta) / N
}

How does ar.yw estimate the variance

In R, how does the function ar.yw estimate the variance? Specifically, where does the number "var.pred" come from? It does not seem to come from the usual YW estimate of the variance, nor the sum of squared residuals divided by df (even though there is disagreement about what the df should be, none of the choices give an answer equivalent to var.pred). And yes, I know that there are better methods than YW; just trying to figure out what R is doing.
set.seed(82346)
temp <- arima.sim(n=10, list(ar = 0.5), sd=1)
fit <- ar(temp, method = "yule-walker", demean = FALSE, aic=FALSE, order.max=1)
## R's estimate of the sigma squared
fit$var.pred
## YW estimate
sum(temp^2)/10 - fit$ar*sum(temp[2:10]*temp[1:9])/10
## YW if there was a mean
sum((temp-mean(temp))^2)/10 - fit$ar*sum((temp[2:10]-mean(temp))*(temp[1:9]-mean(temp)))/10
## estimate based on residuals, different possible df.
sum(na.omit(fit$resid^2))/10
sum(na.omit(fit$resid^2))/9
sum(na.omit(fit$resid^2))/8
sum(na.omit(fit$resid^2))/7
Need to read the code if it's not documented.
?ar.yw
Which says: "In ar.yw the variance matrix of the innovations is computed from the fitted coefficients and the autocovariance of x." If that is not enough explanation, then you need to look at the code:
methods(ar.yw)
#[1] ar.yw.default* ar.yw.mts*
#see '?methods' for accessing help and source code
getAnywhere(ar.yw.default)
# there are two cases that I see
x <- as.matrix(x)
nser <- ncol(x)
if (nser > 1L) # .... not your situation
#....
else{
r <- as.double(drop(xacf))
z <- .Fortran(C_eureka, as.integer(order.max), r, r,
coefs = double(order.max^2), vars = double(order.max),
double(order.max))
coefs <- matrix(z$coefs, order.max, order.max)
partialacf <- array(diag(coefs), dim = c(order.max, 1L,
1L))
var.pred <- c(r[1L], z$vars)
#.......
order <- if (aic)
(0L:order.max)[xaic == 0L]
else order.max
ar <- if (order)
coefs[order, seq_len(order)]
else numeric()
var.pred <- var.pred[order + 1L]
var.pred <- var.pred * n.used/(n.used - (order + 1L))
So you now need to find the Fortran code for C_eureka. I think I'm finding it here: https://svn.r-project.org/R/trunk/src/library/stats/src/eureka.f This is the code that aI think is returning the var.pred estimate. I'm not a time series guy and It's your responsibility to review this process for applicability to your problem.
subroutine eureka (lr,r,g,f,var,a)
c
c solves Toeplitz matrix equation toep(r)f=g(1+.)
c by Levinson's algorithm
c a is a workspace of size lr, the number
c of equations
c
snipped
c estimate the innovations variance
var(l) = var(l-1) * (1 - f(l,l)*f(l,l))
if (l .eq. lr) return
d = 0.0d0
q = 0.0d0
do 50 i = 1, l
k = l-i+2
d = d + a(i)*r(k)
q = q + f(l,i)*r(k)
50 continue

Portfolio optimization

I am trying to build a portfolio which is optimized with respect to another in R.
I am trying to minimize the objective function
$$min Var(return_p-return'weight_{bm})$$
with the constraints
$$ 1_n'w = 1$$
$$w > .005$$
$$w < .8$$
with w being the returns from a portfolio. there are 10 securities, so I set the benchmark weights at .1 each.
I know that
$$ Var(return_p-return'weight_{bm})= var(r) + var(r'w_{bm}) - 2*cov(r_p, r'w_{bm})=var(r'w)-2cov(r'w,r'w_{bm})=w'var(r)w-2cov(r'w,r'w_{bm})$$
$$=w'var(r)w-2cov(r',r'w_bm)w$$
the last term is of the form I need so I tried to solve this with solve.QP in R, the constraints are giving me a problem though.
here is my code
trackport <- array(rnorm(obs * assets, mean = .2, sd = .15), dim = c(obs,
assets)) #this is the portfolio which the assets are tracked against
wbm <- matrix(rep(1/assets, assets)) #random numbers for the weights
Aeq <- t(matrix(rep(1,assets), nrow=assets, ncol = 1)) #col of 1's to add
#the weights
Beq <- 1 # weights should sum to 1's
H = 2*cov(trackport) #times 2 because of the syntax
#multiplies the returns times coefficients to create a vector of returns for
#the benchmark
rbm = trackport %*% wbm
#covariance between the tracking portfolio and benchmark returns
eff <- cov(trackport, rbm)
#constraints
Amatrix <- t(matrix(c(Aeq, diag(assets), -diag(assets)), ncol = assets,
byrow = T))
Bvector <- matrix(c(1,rep(.005, assets), rep(.8, assets)))
#solve
solQP3 <- solve.QP(Dmat = H,
dvec = zeros, #reduces to min var portfolio for
#troubleshooting purposes
Amat = Amatrix,
bvec = Bvector,
meq = 1)
the error I am getting is "constraints are inconsistent, no solution!" but I can't find what's wrong with my A matrix
My (transposed) A matrix looks like this
[1,1,...,1]
[1,0,...,0]
[0,1,...,0]
...
[0,0,...,1]
[-1,0,...,0]
[0,-1,...,0]
...
[0,0,...,-1]
and my $b_0$ looks like this
[1]
[.005]
[.005]
...
[.005]
[.8]
[.8]
...
[.8]
so I'm not sure why it isn't finding a solution, could anyone take a look?
I'm not familiar with the package, but just took a quick look at https://cran.r-project.org/web/packages/quadprog/quadprog.pdf , which apparently is what you are using.
Your RHS values of .8 should be -0.8 because this function uses ≥ inequalities. So you have been constraining the variables to be ≥ .005 and ≤ -0.8, which of course is not what you want, and is infeasible.
So leave transposed A as is and make
b0:
[1]
[.005]
[.005]
...
[.005]
[-.8]
[-.8]
...
[-.8]

Estimating the Standard Deviation of a ratio using Taylor expansion

I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)

Resources