Standard Normal Quantile Function Integration in R - r

I need to compute a division of integrals, where the function q_alpha(z) is the quantile function of a standard normal distribution.
I got a question regarding the denominator. As the normal standard distribution has Homoscedasticity, it is simmetric, continuous, etc.The integration of the denominator term its simple? I just need to elevated to the square each quantile of this function and proceed to the calculation? Right?
This is my code in R:
library(Bolstad)
thau=1:99/100
z.standard.quantile=qnorm(thau,0,1)
z.standard.quantile.square=qnorm(thau,0,1)^2
sintegral(thau[1:50],z.standard.quantile[1:50])$value/sintegral(thau[1:50], z.standard.quantile.square[1:50])$value
The result is: -0.8676396

There is no problem in taking the square of qnorm, but qnorm is unbounded on [0, 0.5] (note qnorm(0) is -Inf) so the integral is not finite.
My second thought is that there is actually no need to use Bolstad::sintegral (Simpson's rule); the R base function integrate is sufficient. Or, we can discretize qnorm and use Trapezoidal rule because qnorm is a smooth function which can be well approximated by linear interpolation.
I will write a function evaluating the ratio of integral in your question, but lower bounded on l:
## using `integrate`
f1 <- function (l) {
a <- integrate(qnorm, lower = l, upper = 0.5)$value
b <- integrate(function (x) qnorm(x) ^ 2, lower = l, upper = 0.5)$value
a / b
}
## using Trapezoidal rule, with `n` division on interval `[l, 0.5]`
f2 <- function (l, n) {
x <- seq(l, 0.5, length = n)
delta <- x[2] - x[1]
y1 <- qnorm(x)
y2 <- y1 ^ 2
a <- sum(y1[-1] + y1[-n]) / 2 * delta
b <- sum(y2[-1] + y2[-n]) / 2 * delta
a / b
}
Those two functions return rather similar result as we can test:
f1 (0.1)
# [1] -1.276167
f2 (0.1, 1000)
# [1] -1.276166
Now, the only thing of interest is the limiting behaviour when l -> 0 (in a numerical sense). Let's try
l <- 10 ^ (- (1:16))
# [1] 1e-01 1e-02 1e-03 1e-04 1e-05 1e-06 1e-07 1e-08 1e-09 1e-10 1e-11 1e-12
# [13] 1e-13 1e-14 1e-15 1e-16
y1 <- sapply(l, f1)
# [1] -1.2761674 -0.8698411 -0.8096179 -0.7996069 -0.7981338 -0.7979341
# [7] -0.7978877 -0.7978848 -0.7978846 -0.7978846 -0.7978846 -0.7978846
# [13] -0.7978846 -0.7978846 -0.7978846 -0.7978846
## quite a dense grid; takes some time to compute
y2 <- sapply(l, f2, n = 1e+6)
# [1] -1.2761674 -0.8698411 -0.8096179 -0.7996071 -0.7981158 -0.7979137
# [7] -0.7978877 -0.7978834 -0.7978816 -0.7978799 -0.7978783 -0.7978767
# [13] -0.7978750 -0.7978734 -0.7978717 -0.7978700
Now, it looks like there is a limit toward around -0.7978 as l -> 0.
Note, the -0.8676396 you got is actually about f1(0.01) or f2(0.01, 1e+6).

Related

Differentiating pnorm() in R to show that the PDF of a continuous variable is the derivative of the CDF

I need to demonstrate that the probability density function is the derivative of the CDF. Any distribution will do, but I have been trying with the normal. I have got as far as:
set.seed(53)
b <- rnorm(500)
db <- density(b)
plot(db)
Then I can calculate the cumulative probabilities using pnorm(b), but then I don't know how to differentiate, because D() requires an expression rather than pnorm(). Could anyone help, please?
Here's the console scrape from where I demonstrated the near equality (to 5 or 7 decimal places) of the integral of dnorm to pnorm from -Inf to selected values of "x": The Fundamental Theorem of Calculus says that if the integral of a function f(x) is g(x) then f(x) is the derivative of g(x). (Or words to that effect.)
> sapply(c(0,Inf), function(x) integrate(dnorm, lower=-Inf, upper=x))
[,1] [,2]
value 0.5 1
abs.error 4.680562e-05 9.361124e-05
subdivisions 3 3
message "OK" "OK"
call expression expression
> sapply(c(0,Inf), function(x) integrate(dnorm, lower=-Inf, upper=x)$value)
[1] 0.5 1.0
> sapply(seq(-3,3, by=0.5), function(x) integrate(dnorm, lower=-Inf, upper=x)$value)
[1] 0.001349899 0.006209665 0.022750132 0.066807201 0.158655254 0.308537539
[7] 0.500000000 0.691462461 0.841344751 0.933192799 0.977249868 0.993790335
[13] 0.998650102
> pnorm(seq(-3,3, by=0.5)
+ )
[1] 0.001349898 0.006209665 0.022750132 0.066807201 0.158655254 0.308537539
[7] 0.500000000 0.691462461 0.841344746 0.933192799 0.977249868 0.993790335
[13] 0.998650102
I wasn't sure that the D() was "smart" enough to to the symbolic differentiation, but I shouldn't have been so skeptical. This bit of console interaction was done following the examples on the ?deriv help page:
> D(quote(pnorm(x)), "x")
dnorm(x)
Also ... here's something you can get with deriv:
> norm.expr <- expression(pnorm(x))
> deriv(norm.expr, "x")
expression({
.value <- pnorm(x)
.grad <- array(0, c(length(.value), 1L), list(NULL, c("x")))
.grad[, "x"] <- dnorm(x)
attr(.value, "gradient") <- .grad
.value
})

Minimising area under the ROC curve to optimise the parameters of a polynomial predictor with optim

My predictor (x) has U-shaped distribution in relation to the binary outcome (y), with positive outcomes at both low and high values of x, leading to a biconcave roc curve with a poor area under the curve (auc).
To maximise its ability to discriminate the outcome, I am trying to optimise the parameters of a second grade polynomial of x, by using optim and 1 - auc as the cost function to minimise.
x = c(13,7,7,7,1,100,3,4,4,2,2,7,14,8,3,14,5,12,8,
13,9,4,9,4,8,3,13,9,4,4,5,9,10,10,7,6,12,7,2,
6,6,4,3,2,3,10,5,2,5,8,3,5,4,2,7,5,7,6,79,9)
y = c(0,0,1,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0)
theta = c(0, 0, 0)
min_auc <- function(theta, x, y) {
(1 - roc(y, (theta[1] + theta[2]*x + theta[3]*x^2))$auc)
}
optim(theta, min_auc, x = x, y = y)
The results are as follow:
$par
[1] 0.0 0.1 0.0
$value
[1] 0.4380054
$counts
function gradient
8 NA
$convergence
[1] 0
$message
NULL
However, from a manual definition of the parameters, I know that min_auc can be further minimised.
theta = c(0, -40, 1)
(1 - roc(y, (theta[1] + theta[2]*x + theta[3]*(x^2)))$auc)
[1] 0.2762803
Could anyone explain to me what I am doing wrong, please? Is it possibly due to a non-convex cost function?
One possibility is it's a collinearity problem. Scaling the inputs helps:
min_auc <- function(theta, x, y) {
(1 - roc(y, (theta[1] + theta[2]*scale(x) + theta[3]*scale(x)^2))$auc)
}
optim(theta, min_auc, x = x, y = y)
# $par
# [1] -0.02469136 -0.03117284 0.11049383
#
# $value
# [1] 0.2762803
#
# $counts
# function gradient
# 30 NA
#
# $convergence
# [1] 0
#
# $message
# NULL
#
Another potential problem is that the surface over which you're optimizing has some flat spots. Let's say, for example, that we fix the intercept in this equation to -2. This is about what you get if you do qlogis(mean(y)). Then, you're only optimizing over 2 parameters so the surface is easier to see. Here's what it looks like with the two remaining theta terms on the two horizontal axes and the 1-AUC value on the y-axis.
min_auc <- function(theta, x, y) {
(1 - roc(y, (-2 + theta[1]*scale(x) + theta[2]*scale(x)^2))$auc)
}
s <- seq(-.25, .25, length=50)
o <- outer(s, s, Vectorize(function(z,w)min_auc(c(z,w),x,y)))
library(plotly)
plot_ly(x = ~s, y = ~s, z = ~o) %>% add_surface()
As you may have noticed above, there is no unique solution to the problem. There are lots of solutions that seem to get to the minimum value.

It is possible to solve equation R that are not linear?

I want to build a function that takes E[x] and Var[X] and give me the mean and standard error of a univariate lognormal variable.
E[x] = exp(mu + theta)
Var[x] = exp(2*mu + theta)*(exp(theta) - 1)
The function would take E[x] and Var[x] as input and as output would give me theta and mu
There are several packages that provide ways and means to solve a system of nonlinear equations. One of these is nleqslv.
You nee to provide a function that function that returns the differences between the actual value of the equations and the desired value.
Load package nleqslv and define the following function
library(nleqslv)
f <- function(x,Ex,Varx) {
y<- numeric(length(x))
mu <- x[1]
theta <- x[2]
y[1] <- exp(mu+theta) - Ex
y[2] <- exp(2*mu+theta)*(exp(theta)-1) - Varx
y
}
The vector x in the function contains the values of mu and theta.
An example with Ex=2 and Varx=3 and some random starting values
xstart <- c(1,1)
nleqslv(xstart,f,Ex=2,Varx=3)
gives the following
$x
[1] -0.6931472 1.3862944
$fvec
[1] -8.095125e-11 -8.111645e-11
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
[1] 1 1
$nfcnt
[1] 31
$njcnt
[1] 2
$iter
[1] 22
See the manual of nleqslv for the meaning of the different elements of the return value of nleqslv.
If you want to investigate the effect of the different solving methods try this
testnslv(xstart,f,Ex=2,Varx=3)

Problems with modeling AR(1) with three different techniques in R: 1. the formula 2.for-loop 3.arima.sim

(reproducible example added)
I tried to model Yt=0.6Yt-1+Vt (Vt ~ N(0,1)) AR(1) with three different techniques.
1. The formula Yt =ρt[y1/ρ + ∑j=2 to t (α+Vj)/ρj] of Yt=α+ρYt-1+Vt obtained via back substitution and using cumulative sum.
2. The classical for loop
3. arima.sim simulation.
While I expect to get more or less the same things, interesting and unexpected things happened:
N <- 1388; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
# The classical "for" loop
y2 <- ts(rep(0,N)) # y2[1]:=0 defined
for (t in 2:N){ y2[t] <- a + ro*y2[t-1]+v[t] }
# arima.sim simulation
set.seed(1)
y3 <- arima.sim(model=list(ar=0.6), n=1388, mean=0, sd=sqrt(1))
# change n in arima.sim accordingly such that n=N simultaneously with the N definition above
c(mean(y1),sd(y1))
c(mean(y2),sd(y2))
c(mean(y3),sd(y3))
N=1388 (and n=1388) gives:
[1] -0.03713488 1.26102077
[1] -0.03713488 1.26102077
[1] -0.01048798 1.28445899
N=1389 (and n=1389) gives:
[1] Inf NaN
[1] -0.03661779 1.26071373
[1] -0.01048798 1.28445899
The standard deviation values are just as expected:
StdDev(Yt)=sqrt[sd(v)2/(1-ρ2)]
sqrt(sd(v)^2/(1-(0.6)^2)) # 1.284229
Strangenesses appear for the formula technique:
c(mean(y1),sd(y1)) is defined for N<=1388
c(mean(y1),sd(y1)) is (Inf, NaN) for N=1389 and N=1390
c(mean(y1),sd(y1)) is (NaN, NA) for N>=1391.
Question:
1. Why do the formula technique fail for N>=1989?
(though we have a stationary series with ρ=0.6)
2. Why do we get c(mean(y1),sd(y1))#(Inf, NaN) for N=1389 and N=1390 and
c(mean(y1),sd(y1))#(NaN, NA) for N>=1391?
With N <- 1389; ro <- 0.6; a <- 0 the last element of cumsum((a+v[-1]) / ro^(2:N)) is larger than the maximally representable number on your PC i.e. your .Machine$double.xmax. (I am assuming here it is maximally 1.797693e+308. Thus it is represented as Inf. This causes the last element of y1 to become an Inf as well as any legal arithmetic operation on Inf and a numeric yields Inf. sd of a vector conatining an Inf is a NaN:
N <- 1389; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] -8.183997e+306 1.049540e+307 -1.581977e+306 -3.189985e+307 -7.577105e+307
#[6] Inf
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -1.814422e+307 -7.648815e+306 -9.230793e+306 -4.113064e+307 -1.169017e+308
#[6] Inf
tail(y1)
#[1] -1.6598499 -0.4198324 -0.3039989 -0.8127365 -1.3859780 Inf
N <- 1390; ro <- 0.6; a <- 0. The last value of (a+v[-1]) / ro^(2:N) is representable and thus nothing changes.
N <- 1390; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] 1.049540e+307 -1.581977e+306 -3.189985e+307 -7.577105e+307 Inf
#[6] -5.546904e+307
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -7.648815e+306 -9.230793e+306 -4.113064e+307 -1.169017e+308 Inf
#[6] Inf
tail(y1)
#[1] -0.4198324 -0.3039989 -0.8127365 -1.3859780 Inf Inf
N <- 1391; ro <- 0.6; a <- 0 the last value of (a+v[-1]) / ro^(2:N) becomes -Inf (smaller than .Machine$double.xmin) and the cumsum has to add Inf and -Inf, which gives a NaN, which means not a number. The mean of NaN s is not a number as well and the sd(NaN) is NA.
N <- 1391; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] -1.581977e+306 -3.189985e+307 -7.577105e+307 Inf -5.546904e+307
#[6] -Inf
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -9.230793e+306 -4.113064e+307 -1.169017e+308 Inf Inf
#[6] NaN
tail(y1)
#[1] -0.3039989 -0.8127365 -1.3859780 Inf Inf NaN

Mode of density function using optimize

I want to find the mode (x-value) of a univariate density function using R
s optimize function
I.e. For a standard normal function f(x) ~ N(3, 1) the mode should be the mean i.e. x=3.
I tried the following:
# Define the function
g <- function(x) dnorm(x = x, mean = 3, sd = 1)
Dvec <- c(-1000, 1000)
# First get the gradient of the function
gradfun <- function(x){grad(g, x)}
# Find the maximum value
x_mode <- optimize(f=g,interval = Dvec, maximum=TRUE)
x_mode
This gives the incorrect value of the mode as:
$maximum
[1] 999.9999
$objective
[1] 0
Which is incorrect i.e. gives the max value of the (-1000, 1000) interval as opposed to x=3.
Could anyone please help edit the optimisation code.
It will be used to pass more generic functions of x if this simple test case works
I would use optim for this, avoiding to mention the interval. You can tailor the seed by taking the maximum of the function on the original guessed interval:
guessedInterval = min(Dvec):max(Dvec)
superStarSeed = guessedInterval[which.max(g(guessedInterval))]
optim(par=superStarSeed, fn=function(y) -g(y))
#$par
#[1] 3
#$value
#[1] -0.3989423
#$counts
#function gradient
# 24 NA
#$convergence
#[1] 0
#$message
#NULL

Resources