I do not know exactly why the cdf has different values when I change the bandwidth in the kernel density estimation. In the code below, I generate random numbers from a gaussian distribution and estimate the kernel density of the data in selecting different bandwidths (h). When I integrate the pdf I get different values more or less distant from 1. So, it seems that the CDF depends on the bandwidth used in the kernel density estimation. Yet, I beleived that the factor 1/nh in kernel desnity estimation ensures that the pdf integrates to 1. if the bandwidth is the issue, how can I determine it to ensure that the resulting pdf integrates to 1?
g<-1
n<-1000
set.seed(g)
df <- data.frame(x=sort(rnorm(n,0,1)))
library(functional)
gaussianKernel <- function(u) exp(-u^2/2)/(2*pi)^.5
densityFunction <- function(x, df, ker, h){
difference = t(t(df) - x)/h
W = sum(apply(difference, 1, ker)) / (nrow(df)*h)
}
myDensityFunction <- Curry(densityFunction, df=df, ker=gaussianKernel, h=2)
vect<-vector()
for (i in 1:length(df$x)){
f<-myDensityFunction(df$x[i])
vect<-c(vect,f)
}
f <- approxfun(df$x, vect, yleft = 0, yright = 0)
integrate(f, -Inf, Inf)
You are right about the effect of bandwidth. As the bandwidth increases the bias of the estimator increases while the variance decreases. There are some rules of thumb to calculate the bandwidth. I give one below, and you can see that with proper bandwidth, the estimated density curve approaches the exact normal. Vectorizing your functions will give you a speedup
kernel <- function(u) exp(-u^2/2) / (2*pi)^.5
dens <- Vectorize(function(x, df, ker, h) {
1/(h*nrow(df)) * sum(ker((df-x)/h))
}, vec="x")
b <- 1.06*(min(1, IQR(df$x)/1.34))*n^(-1/5) # bandwidth rule-of-thumb
vect <- dens(df$x, df=df, kernel, b)
f <- approxfun(df$x, vect, yleft=0, yright=0)
integrate(f, -Inf, Inf)
# 0.9991092 with absolute error < 0.00012
nvals <- dnorm(df$x)
plot(df$x, nvals)
points(df$x, vect, col="azure3")
Related
I would like to run a monte carlo simulation in r to estimate theta. Could someone please recommend some resources and suggests for how I could do this?
I have started with creating a sample with the gamma distribution and using the shape and rate of the distribution, but I am unsure of where to go next with this.
x = seq(0.25, 2.5, by = 0.25)
PHI <- pgamma(x, shape = 5.5, 2)
CDF <- c()
n= 10000
set.seed(12481632)
y = rgamma(n, shape = 5.5, rate = 2)
You could rewrite your expression for θ, factoring out exponential distribution.
θ = 0∫∞ (x4.5/2) (2 e-2x) dx
Here (2 e-2x) is exponential distribution with rate=2, which suggests how to integrate it using Monte Carlo.
Sample random values from exponential
Compute function (x4.5/2) at sampled values
Mean value of those computed values would be the integral computed by M-C
Code, R 4.0.3 x64, Win 10
set.seed(312345)
n <- 10000
x <- rexp(n, rate = 2.0)
f <- 0.5*x**4.5
mean(f)
prints
[1] 1.160716
You could even estimate statistical error as
sd(f)/sqrt(n)
which prints
[1] 0.1275271
Thus M-C estimation of your integral θ is 1.160716∓0.1275271
What is implemented here is following, e.g. http://www.math.chalmers.se/Stat/Grundutb/CTH/tms150/1112/MC.pdf, 6.1.2, where
g(x) is our power function (x4.5/2), and f(x) is our exponential distribution.
UPDATE
Just to clarify one thing - there is no single canonical way to split under-the-integral expression into sampling PDF f(x) and computable function g(x), mean value of which would be our integral.
E.g., I could write
θ = 0∫∞ (x4.5 e-x) (e-x) dx
e-x would be the PDF f(x). Simple exponential with rate=1, and g(x) how has exponential leftover part. Similar code
set.seed(312345)
n <- 10000
f <- rexp(n, rate = 1.0)
g <- f**4.5*exp(-f)
print(mean(g))
print(sd(g)/sqrt(n))
produced integral value of 1.148697∓0.02158325. It is a bit better approach, because statistical error is smaller.
You could even write it as
θ = Γ(5.5) 0.55.5 0∫∞ 1 G(x| shape=5.5, scale=0.5) dx
where Γ(x) is gamma-function and G(x| shape, scale) is Gamma-distribution. So you could sample from gamma-distribution and g(x)=1 for any sampled x. Thus, this will give you precise answer. Code
set.seed(312345)
f <- rgamma(n, 5.5, scale=0.5)
g <- f**0.0 # g is equal to 1 for any value of f
print(mean(g)*gamma(5.5)*0.5**5.5)
print(sd(g)/sqrt(n))
produced integral value of 1.156623∓0.
The best way to estimate theta given its definition is
theta <- integrate(function(x) x^4.5 * exp(-2*x), from = 0, to = Inf)
Giving:
theta
#> [1] 1.156623
Another way to handle this is by seeing that the constant lambda^rate / gamma(rate) can be taken outside of the cdf integral, and since we know that the cdf at infinity is 1, then theta must equal gamma(rate)/lambda^rate
gamma(5.5)/2^5.5
#> [1] 1.156623
Note that we can also write functions for your pdf and cdf and plot them:
pdf <- function(t, rate, lambda) {
(lambda^rate)/gamma(rate) * t^(rate-1) * exp(-2 * t)
}
cdf <- function(x, rate, lambda) {
sapply(x, function(y) {
integrate(pdf, lower = 0, upper = y, lambda = lambda, rate = rate)$value
})
}
curve(pdf(x, 5.5, 2), from = 0, to = 10)
curve(cdf(x, 5.5, 2), from = 0, to = 10)
It's not quite clear how you would want a Monte Carlo simulation to help you with any of this.
Generally for the inverse sampling method, we have a density and we would like to sample from it. A first step is to find the the cumulative density function for the density. Then to find it's inverse, and finally to find the inverse function for a randomly sampled value from the uniform distribution.
For example, I have this function y= ((3/2)/(1+x)^2) so the cdf equals (3x)/2(x+1) and the inverse of the cdf is ((3/2)*u)/(1-(3/2)*u)
To do this in R, I wrote
f<-function(x){
y= ((3/2)/(1+x)^2)
return(y)
}
cdf <- function(x){
integrate(f, -Inf, x)$value
}
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
U <- runif(1e6)
X <- invcdf(U)
I have two problem! First: the code returns the function and not the samples.
The second: is there another simple way to do this work? for example to find the cdf and inverse in more simple ways?
I would like to add that I am not looking for efficiency of the code. I am just interested of a code that could be written by a beginner.
You could try a numerical approach to inverse sampling. As per your request, this is more about transparency of method than efficiency.
This function will numerically integrate a given function over the given range (though it will trim infinite values)
cdf <- function(f, lower_bound, upper_bound)
{
if(lower_bound < -10000) lower_bound <- -10000 # Trim large negatives
if(upper_bound > 10000) upper_bound <- 10000 # Trim large positive
x <- seq(lower_bound, upper_bound, length.out = 100001) # Finely divide x axis
delta <- mean(diff(x)) # Get delta x (i.e. dx)
mid_x <- (x[-1] + x[-length(x)])/2 # Get the mid point of each slice
result <- cumsum(delta * f(mid_x)) # sum f(x) dx
result <- result / max(result) # normalize
list(x = mid_x, cdf = result) # return both x and f(x) in list
}
And to get the inverse, we find the closest value in the cdf of a random number drawn from the uniform distribution between 0 and 1. We then see which value of x corresponds to that value of the cdf. We want to be able to do this for n samples at a time so we use sapply:
inverse_sample <- function(f, n = 1, lower_bound = -1000, upper_bound = 1000)
{
CDF <- cdf(f, lower_bound, upper_bound)
samples <- runif(n)
sapply(samples, function(s) CDF$x[which.min(abs(s - CDF$cdf))])
}
We can test it by drawing histograms of the results. We'll start with the normal distribution's density function (dnorm in R), drawing 1000 samples and plotting their distribution:
hist(inv_sample(dnorm, 1000))
And we can do the same for the exponential distribution, this time setting the limits of integration between 0 and 100:
hist(inv_sample(dexp, 1000, 0, 100))
And finally we can do the same with your own example:
f <- function(x) 3/2/(1 + x)^2
hist(inv_sample(f, 1000, 0, 10))
I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.
I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939
While trying to port some code from Matlab to R I have run into a problem. The gist of the code is to produce a 2D kernel density estimate and then do some simple calculations using the estimate. In Matlab the KDE calculation was done using the function ksdensity2d.m. In R the KDE calculation is done with kde2d from the MASS package. So lets say I want to calculate the KDE and just add the values (this is not what i intend to do, but it serves this purpose). In R this can be done by
library(MASS)
set.seed(1009)
x <- sample(seq(1000, 2000), 100, replace=TRUE)
y <- sample(seq(-12, 12), 100, replace=TRUE)
kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12))
sum(kk$z)
which gives the answer 0.3932732. When using ksdensity2d in Matlab using the same exact data and conditions the answer is 0.3768. From looking at the code for kde2d I noticed that the bandwidth is divided by 4
kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y)))
{
nx <- length(x)
if (length(y) != nx)
stop("data vectors must be the same length")
if (any(!is.finite(x)) || any(!is.finite(y)))
stop("missing or infinite values in the data are not allowed")
if (any(!is.finite(lims)))
stop("only finite values are allowed in 'lims'")
n <- rep(n, length.out = 2L)
gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
h <- if (missing(h))
c(bandwidth.nrd(x), bandwidth.nrd(y))
else rep(h, length.out = 2L)
if (any(h <= 0))
stop("bandwidths must be strictly positive")
h <- h/4
ax <- outer(gx, x, "-")/h[1L]
ay <- outer(gy, y, "-")/h[2L]
z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay),
, nx))/(nx * h[1L] * h[2L])
list(x = gx, y = gy, z = z)
}
A simple check to see if the difference in bandwidth is the reason for the difference in the results is then
kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12))
sum(kk$z)
which gives 0.3768013 (which is the same as the Matlab answer).
So my question is then: Why does kde2d divide the bandwidth by four? (Or why doesn't ksdensity2d?)
At the mirrored github source, lines 31-35:
if (any(h <= 0))
stop("bandwidths must be strictly positive")
h <- h/4 # for S's bandwidth scale
ax <- outer(gx, x, "-" )/h[1L]
ay <- outer(gy, y, "-" )/h[2L]
and the help file for kde2d(), which suggests looking at the help file for bandwidth. That says:
...which are all scaled to the width argument of density and so give
answers four times as large.
But why?
density() says that the width argument exists for the sake of compatibility with S (the precursor to R). The comments in the source for density() read:
## S has width equal to the length of the support of the kernel
## except for the gaussian where it is 4 * sd.
## R has bw a multiple of the sd.
The default is the Gaussian one. When the bw argument is unspecified and width is, width is substituted in, eg.
library(MASS)
set.seed(1)
x <- rnorm(1000, 10, 2)
all.equal(density(x, bw = 1), density(x, width = 4)) # Only the call is different
However, because kde2d() was apparently written to remain compatible with S (and I suppose it was originally written FOR S, given it's in MASS), everything ends up divided by four. After flipping to the relevant section of MASS the book (around p.126), it seems they may have picked four to strike a balance between smoothness and fidelity of data.
In conclusion, my guess is that kde2d() divides by four to remain consistent with the rest of MASS (and other things originally written for S), and that the way you're going about things looks fine.