I need to demonstrate that the probability density function is the derivative of the CDF. Any distribution will do, but I have been trying with the normal. I have got as far as:
set.seed(53)
b <- rnorm(500)
db <- density(b)
plot(db)
Then I can calculate the cumulative probabilities using pnorm(b), but then I don't know how to differentiate, because D() requires an expression rather than pnorm(). Could anyone help, please?
Here's the console scrape from where I demonstrated the near equality (to 5 or 7 decimal places) of the integral of dnorm to pnorm from -Inf to selected values of "x": The Fundamental Theorem of Calculus says that if the integral of a function f(x) is g(x) then f(x) is the derivative of g(x). (Or words to that effect.)
> sapply(c(0,Inf), function(x) integrate(dnorm, lower=-Inf, upper=x))
[,1] [,2]
value 0.5 1
abs.error 4.680562e-05 9.361124e-05
subdivisions 3 3
message "OK" "OK"
call expression expression
> sapply(c(0,Inf), function(x) integrate(dnorm, lower=-Inf, upper=x)$value)
[1] 0.5 1.0
> sapply(seq(-3,3, by=0.5), function(x) integrate(dnorm, lower=-Inf, upper=x)$value)
[1] 0.001349899 0.006209665 0.022750132 0.066807201 0.158655254 0.308537539
[7] 0.500000000 0.691462461 0.841344751 0.933192799 0.977249868 0.993790335
[13] 0.998650102
> pnorm(seq(-3,3, by=0.5)
+ )
[1] 0.001349898 0.006209665 0.022750132 0.066807201 0.158655254 0.308537539
[7] 0.500000000 0.691462461 0.841344746 0.933192799 0.977249868 0.993790335
[13] 0.998650102
I wasn't sure that the D() was "smart" enough to to the symbolic differentiation, but I shouldn't have been so skeptical. This bit of console interaction was done following the examples on the ?deriv help page:
> D(quote(pnorm(x)), "x")
dnorm(x)
Also ... here's something you can get with deriv:
> norm.expr <- expression(pnorm(x))
> deriv(norm.expr, "x")
expression({
.value <- pnorm(x)
.grad <- array(0, c(length(.value), 1L), list(NULL, c("x")))
.grad[, "x"] <- dnorm(x)
attr(.value, "gradient") <- .grad
.value
})
My predictor (x) has U-shaped distribution in relation to the binary outcome (y), with positive outcomes at both low and high values of x, leading to a biconcave roc curve with a poor area under the curve (auc).
To maximise its ability to discriminate the outcome, I am trying to optimise the parameters of a second grade polynomial of x, by using optim and 1 - auc as the cost function to minimise.
x = c(13,7,7,7,1,100,3,4,4,2,2,7,14,8,3,14,5,12,8,
13,9,4,9,4,8,3,13,9,4,4,5,9,10,10,7,6,12,7,2,
6,6,4,3,2,3,10,5,2,5,8,3,5,4,2,7,5,7,6,79,9)
y = c(0,0,1,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0)
theta = c(0, 0, 0)
min_auc <- function(theta, x, y) {
(1 - roc(y, (theta[1] + theta[2]*x + theta[3]*x^2))$auc)
}
optim(theta, min_auc, x = x, y = y)
The results are as follow:
$par
[1] 0.0 0.1 0.0
$value
[1] 0.4380054
$counts
function gradient
8 NA
$convergence
[1] 0
$message
NULL
However, from a manual definition of the parameters, I know that min_auc can be further minimised.
theta = c(0, -40, 1)
(1 - roc(y, (theta[1] + theta[2]*x + theta[3]*(x^2)))$auc)
[1] 0.2762803
Could anyone explain to me what I am doing wrong, please? Is it possibly due to a non-convex cost function?
One possibility is it's a collinearity problem. Scaling the inputs helps:
min_auc <- function(theta, x, y) {
(1 - roc(y, (theta[1] + theta[2]*scale(x) + theta[3]*scale(x)^2))$auc)
}
optim(theta, min_auc, x = x, y = y)
# $par
# [1] -0.02469136 -0.03117284 0.11049383
#
# $value
# [1] 0.2762803
#
# $counts
# function gradient
# 30 NA
#
# $convergence
# [1] 0
#
# $message
# NULL
#
Another potential problem is that the surface over which you're optimizing has some flat spots. Let's say, for example, that we fix the intercept in this equation to -2. This is about what you get if you do qlogis(mean(y)). Then, you're only optimizing over 2 parameters so the surface is easier to see. Here's what it looks like with the two remaining theta terms on the two horizontal axes and the 1-AUC value on the y-axis.
min_auc <- function(theta, x, y) {
(1 - roc(y, (-2 + theta[1]*scale(x) + theta[2]*scale(x)^2))$auc)
}
s <- seq(-.25, .25, length=50)
o <- outer(s, s, Vectorize(function(z,w)min_auc(c(z,w),x,y)))
library(plotly)
plot_ly(x = ~s, y = ~s, z = ~o) %>% add_surface()
As you may have noticed above, there is no unique solution to the problem. There are lots of solutions that seem to get to the minimum value.
I want to build a function that takes E[x] and Var[X] and give me the mean and standard error of a univariate lognormal variable.
E[x] = exp(mu + theta)
Var[x] = exp(2*mu + theta)*(exp(theta) - 1)
The function would take E[x] and Var[x] as input and as output would give me theta and mu
There are several packages that provide ways and means to solve a system of nonlinear equations. One of these is nleqslv.
You nee to provide a function that function that returns the differences between the actual value of the equations and the desired value.
Load package nleqslv and define the following function
library(nleqslv)
f <- function(x,Ex,Varx) {
y<- numeric(length(x))
mu <- x[1]
theta <- x[2]
y[1] <- exp(mu+theta) - Ex
y[2] <- exp(2*mu+theta)*(exp(theta)-1) - Varx
y
}
The vector x in the function contains the values of mu and theta.
An example with Ex=2 and Varx=3 and some random starting values
xstart <- c(1,1)
nleqslv(xstart,f,Ex=2,Varx=3)
gives the following
$x
[1] -0.6931472 1.3862944
$fvec
[1] -8.095125e-11 -8.111645e-11
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
[1] 1 1
$nfcnt
[1] 31
$njcnt
[1] 2
$iter
[1] 22
See the manual of nleqslv for the meaning of the different elements of the return value of nleqslv.
If you want to investigate the effect of the different solving methods try this
testnslv(xstart,f,Ex=2,Varx=3)
(reproducible example added)
I tried to model Yt=0.6Yt-1+Vt (Vt ~ N(0,1)) AR(1) with three different techniques.
1. The formula Yt =ρt[y1/ρ + ∑j=2 to t (α+Vj)/ρj] of Yt=α+ρYt-1+Vt obtained via back substitution and using cumulative sum.
2. The classical for loop
3. arima.sim simulation.
While I expect to get more or less the same things, interesting and unexpected things happened:
N <- 1388; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
# The classical "for" loop
y2 <- ts(rep(0,N)) # y2[1]:=0 defined
for (t in 2:N){ y2[t] <- a + ro*y2[t-1]+v[t] }
# arima.sim simulation
set.seed(1)
y3 <- arima.sim(model=list(ar=0.6), n=1388, mean=0, sd=sqrt(1))
# change n in arima.sim accordingly such that n=N simultaneously with the N definition above
c(mean(y1),sd(y1))
c(mean(y2),sd(y2))
c(mean(y3),sd(y3))
N=1388 (and n=1388) gives:
[1] -0.03713488 1.26102077
[1] -0.03713488 1.26102077
[1] -0.01048798 1.28445899
N=1389 (and n=1389) gives:
[1] Inf NaN
[1] -0.03661779 1.26071373
[1] -0.01048798 1.28445899
The standard deviation values are just as expected:
StdDev(Yt)=sqrt[sd(v)2/(1-ρ2)]
sqrt(sd(v)^2/(1-(0.6)^2)) # 1.284229
Strangenesses appear for the formula technique:
c(mean(y1),sd(y1)) is defined for N<=1388
c(mean(y1),sd(y1)) is (Inf, NaN) for N=1389 and N=1390
c(mean(y1),sd(y1)) is (NaN, NA) for N>=1391.
Question:
1. Why do the formula technique fail for N>=1989?
(though we have a stationary series with ρ=0.6)
2. Why do we get c(mean(y1),sd(y1))#(Inf, NaN) for N=1389 and N=1390 and
c(mean(y1),sd(y1))#(NaN, NA) for N>=1391?
With N <- 1389; ro <- 0.6; a <- 0 the last element of cumsum((a+v[-1]) / ro^(2:N)) is larger than the maximally representable number on your PC i.e. your .Machine$double.xmax. (I am assuming here it is maximally 1.797693e+308. Thus it is represented as Inf. This causes the last element of y1 to become an Inf as well as any legal arithmetic operation on Inf and a numeric yields Inf. sd of a vector conatining an Inf is a NaN:
N <- 1389; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] -8.183997e+306 1.049540e+307 -1.581977e+306 -3.189985e+307 -7.577105e+307
#[6] Inf
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -1.814422e+307 -7.648815e+306 -9.230793e+306 -4.113064e+307 -1.169017e+308
#[6] Inf
tail(y1)
#[1] -1.6598499 -0.4198324 -0.3039989 -0.8127365 -1.3859780 Inf
N <- 1390; ro <- 0.6; a <- 0. The last value of (a+v[-1]) / ro^(2:N) is representable and thus nothing changes.
N <- 1390; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] 1.049540e+307 -1.581977e+306 -3.189985e+307 -7.577105e+307 Inf
#[6] -5.546904e+307
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -7.648815e+306 -9.230793e+306 -4.113064e+307 -1.169017e+308 Inf
#[6] Inf
tail(y1)
#[1] -0.4198324 -0.3039989 -0.8127365 -1.3859780 Inf Inf
N <- 1391; ro <- 0.6; a <- 0 the last value of (a+v[-1]) / ro^(2:N) becomes -Inf (smaller than .Machine$double.xmin) and the cumsum has to add Inf and -Inf, which gives a NaN, which means not a number. The mean of NaN s is not a number as well and the sd(NaN) is NA.
N <- 1391; ro <- 0.6; a <- 0
set.seed(1)
v <- ts(rnorm(N,0,1))
# Formula technique
y1 <- ts(rep(0,N)) # y[1]:=0 defined
y1[-1] <- ro^(2:N) * (y1[1]/ro + cumsum((a+v[-1]) / ro^(2:N)))
tail((a+v[-1]) / ro^(2:N))
#[1] -1.581977e+306 -3.189985e+307 -7.577105e+307 Inf -5.546904e+307
#[6] -Inf
tail(cumsum((a+v[-1]) / ro^(2:N)))
#[1] -9.230793e+306 -4.113064e+307 -1.169017e+308 Inf Inf
#[6] NaN
tail(y1)
#[1] -0.3039989 -0.8127365 -1.3859780 Inf Inf NaN
I want to find the mode (x-value) of a univariate density function using R
s optimize function
I.e. For a standard normal function f(x) ~ N(3, 1) the mode should be the mean i.e. x=3.
I tried the following:
# Define the function
g <- function(x) dnorm(x = x, mean = 3, sd = 1)
Dvec <- c(-1000, 1000)
# First get the gradient of the function
gradfun <- function(x){grad(g, x)}
# Find the maximum value
x_mode <- optimize(f=g,interval = Dvec, maximum=TRUE)
x_mode
This gives the incorrect value of the mode as:
$maximum
[1] 999.9999
$objective
[1] 0
Which is incorrect i.e. gives the max value of the (-1000, 1000) interval as opposed to x=3.
Could anyone please help edit the optimisation code.
It will be used to pass more generic functions of x if this simple test case works
I would use optim for this, avoiding to mention the interval. You can tailor the seed by taking the maximum of the function on the original guessed interval:
guessedInterval = min(Dvec):max(Dvec)
superStarSeed = guessedInterval[which.max(g(guessedInterval))]
optim(par=superStarSeed, fn=function(y) -g(y))
#$par
#[1] 3
#$value
#[1] -0.3989423
#$counts
#function gradient
# 24 NA
#$convergence
#[1] 0
#$message
#NULL