How to calculate Kullback-leiber divergence of Kernel estimation in R - r

I used Kernel estimation to get a non parametric probability density function. Then, I want to compare the tails 'distance' between two Kernel distribution of continuous variables, using Kullback-leiber divergence. I have tried the following code:
kl_l <- function(x,y) {
integrand <- function(x,y) {
f.x <- fitted(density(x, bw="nrd0"))
f.y <- fitted(density(y, bw="nrd0"))
return((log(f.x)-log(f.y))*f.x)
}
return(integrate(integrand, lower=-Inf,upper=quantile(density(x, bw="nrd0"),0.25))$value)
#the Kullback-leiber equation
}
When I run kl_l(a,b) for a, b = 19 continuous variables, it returns a warning
Error in density(y, bw = "nrd0") : argument "y" is missing, with no default
Is there any way to calculate this?
(If anyone wants to see the actual equation: https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2019/attention-to-the-tails-global-financial-conditions-and-exchange-rate-risks.pdf page 13.)

In short, I think you just need to move the f.x and f.y outside the integrand (and possibly replace fitted with approxfun):
kl_l <- function(x, y) {
f.x <- approxfun(density(x, bw = "nrd0"))
f.y <- approxfun(density(y, bw = "nrd0"))
integrand <- function(z) {
return((log(f.x(z)) - log(f.y(z))) * f.x(z))
}
return(integrate(integrand, lower = -Inf, upper = quantile(density(x, bw="nrd0"), 0.25))$value)
#the Kullback-leiber equation
}
Expanding a little:
Looking at the paper you referenced, it appears as though you need to first create the two fitted distributions f and g. So if your variable a contains observations under the 1-standard-deviation increase in global financial conditions, and b contains the observations under average global financial conditions, you can create two functions as in your example:
f <- approxfun(density(a))
g <- approxfun(density(b))
Then define the integrand:
integrand <- function(x) log(f(x) / g(x)) * f(x)
The upper bound:
upper <- quantile(density(b, bw = "nrd0"), 0.25)
And finally do the integration on x within the specified bounds. Note that each value of x in the numerical computation has to go into both f and g; in your function kl_l, the x and y were separately going into the integrand, which I think is incorrect; and in any case, integrate will only have operated on the first variable.
integrate(integrand, lower = -Inf, upper = upper)$value
One thing to check for is that approxfun returns NA for values outside the range specified in the density, which can mess up your operation, so you'll need to adjust for those (if you expect the density to go to zero, for example).

Related

How to fix code in RMarkdown for simulation that will not run due to "Error: Discrete value supplied to continuous scale"?

I am wondering what is wrong with my following R code (R markdown)? I keep getting an error message for the last line that says "Error in h(x.n, df = N - 2) : unused argument (df = N - 2)". I am very confused because my TA looked at my code and told me that it should run perfectly.
For context, this is the problem I am working on:
library(MASS)
library(tidyverse)
library(hypergeo)
set.seed(1)
rm(list=ls())
N=7
Nsim=10000
rho=0
Sigma=matrix(c(1,rho,rho,1),2,2)
Sigma
mu=c(0,0)
r_vec=matrix(NaN,nrow=1,ncol=Nsim)
#have function mvrnorm-->simulate from multivariate normal distribution. N=7 Correlation matrix sigma. before X was fixed but now is random and formal dependence from Y that I can control. Compute rho hat and see if on average it gives me correct rho. Check how serious bias is when the expected value of rho hat isn't equal to rho. I want a feeling about whether this is something I should worry about or not
for (i in 1:Nsim){
data=mvrnorm(N, mu, Sigma)
r_vec[i]=cor(data[,1],data[,2])
}
mean(r_vec)
update.packages("deSolve")
x.n=seq(-1,1,0.1)
sim_rho0<-function(Nsim,N,rho){
rho=rho
mu=c(0,0)
Sigma=matrix(c(1,rho,rho,1),nrow=2)
r_vec=matrix(NaN,nrow=Nsim)
for (i in 1:Nsim){
data=mvrnorm(N, mu, Sigma)
r_vec[i]=cor(data[,1],data[,2])
}
# here we compute t, which should have a t_{N-2} distribution. This is different here and trying to reconstruct the .Not a mathematical proof. Might be a mistake*****
#range of values and plotting density for each one
h<- function(N,rho,x.n){
rho=rho
a <- ((N-2)*(gamma(N-1))*(1-rho^2)^(N-1)/2*(1-x.n^2)^(N-4)/2)/((2*pi)*(sqrt(N-1/2))((1-x.n*rho)^(N-3/2)))
b <- hypergeo(1/2, 1/2, (2*N-1/2), ((x.n*rho)+1)/2)
h2 = a*b
return(h2)
}
t=r_vec*sqrt(N-2)/(1-r_vec^2)
x.n=seq(-1,1,0.1)
y.n= h(N=10, rho=0.8, x.n=x.n)
df=tibble(X=t)
df2=tibble(x=x.n,y=y.n)
ggplot()+geom_histogram(data=df, aes(x=X,y=..density..),binwidth=0.2,
color="black", fill="white")+ geom_line(data = df2, aes(x = x, y = y),
color = "red")+xlim(-5,5)
}
rho=0.8
Nsim=3000
N=10
sim_rho0(Nsim,N,rho)
You've defined that the function h has the arguments N, rho and x.n. Then you try to call it with the argument df which h does not have, therefore you get the error. You need to call h with the correct arguments (i.e. also don't leave out N and rho, and if the value x.n should be passed to the function argument x.n, you need to specify it (don't use a positional argument). I also recommend to follow a style guide, e.g. https://style.tidyverse.org/

Trying to plot loglikelihood of Cauchy distribution for different values of theta in R

I am trying to plot the log-likelihood function of the Cauchy distribution for varying values of theta (location parameter). These are my observations:
obs<-c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
Here is my log-likelihood function:
ll_c<-function(theta,x_values){
n<-length(x_values)
logl<- -n*log(pi)-sum(log(1+(x_values-theta)^2))
return(logl)
}
and Ive tried making a plot by using this code:
x<-seq(from=-10,to=10,by=0.1);length(x)
theta_null<-NULL
for (i in x){
theta_log<-ll_c(i,counts)
theta_null<-c(theta_null,theta_log)
}
plot(theta_null)
The graph does not look right and for some reason the length of x and theta_null differs.
I am assuming that theta is your location parameter (the scale is set to 1 in my example). You should obtain the same result using a t-distribution with 1 df and shifting the observations by theta. I left some comments in the code as guidance.
obs = c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
ll_c=function(theta, obs)
{
# Compute log-lik for obs and a value of thet (location)
logl= sum(dcauchy(obs, location = theta, scale = 1, log = T))
return(logl)
}
# Loop for possible values of theta(obs given)
x = seq(from=-10,to=10,by=0.1)
ll = NULL
for (i in x)
{
ll = c(ll, ll_c(i, obs))
}
# Plot log-lik vs possible value of theta
plot(x, ll)
It is hard to say exactly what you are experiencing without more info. But I'll make an educated guess.
First of all, we can simplify this a lot by using the *t family of functions for the t distribution, as the cauchy distribution is just the t distribution with df = 1. So your calculations could've been done using
for(i in ncp)
theta_null <- c(theta_null, sum(dt(values, 1, i, log = TRUE)))
Note that multiplying by n doesn't actually matter for any practical purposes. We are usually interested in minimizing/maximizing the likelihood in which case all constants are irrelevant.
Now if we use this approach, we can quite quickly notice something by printing the values:
print(head(theta_null))
[1] -Inf -Inf -Inf -Inf -Inf -Inf
So I am assuming what you are experiencing is that many of your values are "almost" negative infinity, and maybe these are not stored correctly in your outcome vector. I can't see that this should be the case from your code, but this would be my initial guess.

Nested integration for incomplete convolution of gauss densities

Let g(x) = 1/(2*pi) exp ( - x^2 / 2) be the density of the normal distribution with mean 0 and standard deviation 1. In some calculation on paper appeared integrals of the form
where c>0 is a positive number.
Since I could not evaluate this by hand, I had the idea to approximate and plot it. I tried this in R, because R provides the dnorm function and a function to do integrals.
You see that I need to integrate numerically n times, where n shall be chosed by the call of a plot function. My code has an for-loop to create those "incomplete" convolutions iterativly.
For example even with n=3 and c=1 this gives me an error. n=2 (thus it's one integration) works.
N = 3
ngauss <- function(x) dnorm(x , mean = 0, sd = 1)
convoluts <- list()
convoluts[[1]] <- ngauss
for (i in 2:N) {
h <- function(y) {
g <- function(z) {ngauss(y-z)*convoluts[[i-1]](z)}
return(integrate(g, lower = -1, upper = 1)$value)
}
h <- Vectorize(h)
convoluts[[i]] <- h
}
convoluts[[3]](0)
What I get is:
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?
I understand that this is a hard computation, but for "small" n something similar should possible.
Maybe someone can help me to fix my code or provide a recommendation how I can implement this in a better way. Another language that is more appropriate for this would be also okay.
The issue appears to be in how integrate deals with variables in different environments. In particular, it doesn't really deal with i correctly in each iteration. Instead using
h <- evalq(function(y) {
g <- function(z) {ngauss(y - z) * convoluts[[i - 1]](z)}
integrate(g, lower = -1, upper = 1)$value
}, list(i = i))
does the job and, say, setting N <- 6 quickly gives
convoluts[[N]](0)
# [1] 0.03423872
As your integration is simply the pdf of a sum of N independent standard normals (which then follows N(0, N)), we may also verify this approach by setting lower = -Inf and upper = Inf. Then with N <- 4 we have
dnorm(0, sd = sqrt(N))
# [1] 0.1994711
convoluts[[N]](0)
# [1] 0.1994711
So, for practical purposes, when c = Inf, you are way better off using dnorm rather than manual computations.

Error in optim(): searching for global minimum for a univariate function

I am trying to optmize a function in R
The function is the Likelihood function of negative binominal when estimating only mu parameter. This should not be a problem since the function clearly has just one point of maximum. But, I am not being able to reach the desirable result.
The function to be optmized is:
EMV <- function(data, par) {
Mi <- par
Phi <- 2
N <- NROW(data)
Resultado <- log(Mi/(Mi + Phi))*sum(data) + N*Phi*log(Phi/(Mi + Phi))
return(Resultado)
}
Data is a vector of negative binomial variables with parameters 2 and 2
data <- rnegbin(10000, mu = 2, theta = 2)
When I plot the function having mu as variable with the following code:
x <- seq(0.1, 100, 0.02)
z <- EMV(data,0.1)
for (aux in x) {z <- rbind(z, EMV(data,aux))}
z <- z[2:NROW(z)]
plot(x,z)
I get the following curve:
And the maximum value of z is close to parameter value --> 2
x[which.max(z)]
But the optimization is not working with BFGS
Error in optim(par = theta, fn = EMV, data = data, method = "BFGS") :
non-finite finite-difference value [1]
And is not going to right value using SANN, for example:
$par
[1] 5.19767e-05
$value
[1] -211981.8
$counts
function gradient
10000 NA
$convergence
[1] 0
$message
NULL
The questions are:
What am I doing wrong?
Is there a way to tell optim that the param should be bigger than 0?
Is there a way to tell optim that I want to maximize the function? (I am afraid the optim is trying to minimize and is going to a very small value where function returns smallest values)
Minimization or Maximization?
Although ?optim says it can do maximization, but that is in a bracket, so minimization is default:
fn: A function to be minimized (or maximized) ...
Thus, if we want to maximize an objective function, we need to multiply an -1 to it, and then minimize it. This is quite a common situation. In statistics we often want to find maximum log likelihood, so to use optim(), we have no choice but to minimize the negative log likelihood.
Which method to use?
If we only do 1D minimization, we should use method "Brent". This method allows us to specify a lower bound and an upper bound of search region. Searching will start from one bound, and search toward the other, until it hit the minimum, or it reach the boundary. Such specification can help you to constrain your parameters. For example, you don't want mu to be smaller than 0, then just set lower = 0.
When we move to 2D or higher dimension, we should resort to "BFGS". In this case, if we want to constrain one of our parameters, say a, to be positive, we need to take log transform log_a = log(a), and reparameterize our objective function using log_a. Now, log_a is free of constraint. The same goes when we want constrain multiple parameters to be positive.
How to change your code?
EMV <- function(data, par) {
Mi <- par
Phi <- 2
N <- NROW(data)
Resultado <- log(Mi/(Mi + Phi))*sum(data) + N*Phi*log(Phi/(Mi + Phi))
return(-1 * Resultado)
}
optim(par = theta, fn = EMV, data = data, method = "Brent", lower = 0, upper = 1E5)
The help file for optim says: "By default optim performs minimization, but it will maximize if control$fnscale is negative." So if you either multiply your function output by -1 or change the control object input, you should get the right answer.

Fitting an inverse function

I have a function which looks like:
g(x) = f(x) - a^b / f(x)^b
g(x) - known function, data vector provided.
f(x) - hidden process.
a,b - parameters of this function.
From the above we get the relation:
f(x) = inverse(g(x))
My goal is to optimize parameters a and b such that f(x) would be as close as possible
to a normal distribution. If we look on a f(x) Q-Q normal plot (attached), my purpose is to minimize the distance between f(x) to the straight line which represents the normal distribution, by optimizing parameters a and b.
I wrote the below code:
g_fun <- function(x) {x - a^b/x^b}
inverse = function (f, lower = 0, upper = 2000) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
f_func = inverse(function(x) g_fun(x))
enter code here
# let's made up an example
# g(x) values are known
g <- c(-0.016339, 0.029646, -0.0255258, 0.003352, -0.053258, -0.018971, 0.005172,
0.067114, 0.026415, 0.051062)
# Calculate f(x) by using the inverse of g(x), when a=a0 and b=b0
for (i in 1:10) {
f[i] <- f_fun(g[i])
}
I have two question:
How to pass parameters a and b to the functions?
How to perform this optimization task, meaning find a and b such that f(x) would approximate normal distribution.
Not sure how you were able to produce the Q-Q plot since your provided examples do not work. You are not specifying the values of a and b and you are defining f_func but calling f_fun. Anyway here is my answer to your questions:
How to pass parameters a and b to the functions? - Just pass them as
arguments to the functions.
How to perform this optimization task, meaning find a and b such that f(x) would approximate normal distribution? - The same way any optimization task is done. Define a cost function, then minimize it.
Here is the revised code: I have added a and b as parameters, removed the inverse function and incorporated it inside f_func, which can now take vector input so no need for a for loop.
g_fun <- function(x,a,b) {x - a^b/x^b}
f_func = function(y,a,b,lower = 0, upper = 2000){
sapply(y,function(z) { uniroot(function(x) g_fun(x,a,b) - z, lower = lower, upper = upper)$root})
}
# g(x) values are known
g <- c(-0.016339, 0.029646, -0.0255258, 0.003352, -0.053258, -0.018971, 0.005172,
0.067114, 0.026415, 0.051062)
f <- f_func(g,1,1) # using a = 1 and b = 1
#[1] 0.9918427 1.0149329 0.9873386 1.0016774 0.9737270 0.9905320 1.0025893
#[8] 1.0341199 1.0132947 1.0258569
f_func(g,2,10)
[1] 1.876408 1.880554 1.875578 1.878138 1.873094 1.876170 1.878304 1.884049
[9] 1.880256 1.882544
Now for the optimization part, it depends on what you mean by f(x) would approximate normal distribution. You can compare mean square error from the qq-line if you want. Also since you say approximate, how close is good enough? You can go with shapiro.test and keep searching till you find p-value below 0.05 (be ware that there may not be a solution)
shapiro.test(f_func(g,1,2))$p
[1] 0.9484821
cost <- function(x,y) shapiro.test(f_func(g,x,y))$p
Now that we have a cost function how do we go about minimizing it. There are many many different ways to do numerical optimization. Take a look at optim function http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optim.html.
optim(c(1,1),cost)
This final line does not work, but without proper data and context this is as far as I can go. Hope this helps.

Resources