Grid specification in smooth c.d.f. estimation ("kerdiest" package) - r

I wanted to get a smooth estimate of a cumulative distribution function. One of ways to do this is to integrate a kernel density estimator, getting a kernel distribution estimator. In order to get one, I used the kde function from the "kerdiest" package.
The problem is that I have to specify a grid which affects the results greatly. The default choice of grid leads to a graph that differs from the plot of empirical distribution function significantly (see the picture; white dots represent the empirical c.d.f.). I can pick up grid values so that the kernel estimator and ecdf would coincide but I do not understand how it works.
So, what is the grid and how should it be chosen? Is there any other way to get a kernel estimator of a distribution function?
The data I have been experimenting with is waiting times of the Old Faithful Geyser dataset in R.
The code is
x <- faithful$waiting
library("kerdiest")
n = length(x)
kcdf <- kde(type_kernel = "n", x, bw = 1/sqrt(n))
plot(kcdf$Estimated_values)
lines(ecdf(x))

Instead of plotting with the default plot function you should be using both the Estimated_values and the grid values to form the initial plot. The the lines function will have the correct x-values . (The clue here is the labeling of your plot. When seeing the "Index" label, you might have wondered whether it was the correct scale. When plot gets a single vector of numeric values it uses their ordering sequence as the "Index" value, so you see integers: 1:length(vector))
with( kcdf, plot(Estimated_values ~ grid) ) # using plot.formula
lines(ecdf(x))

Related

Test multivariate normality of 2D binned data in R

I have some heatmap data and I want a notion as to whether that heat map is 'centered' around the middle of my image or skewed to one side (in R). My data is too big to give an example here, so this is some fake data of the same form (but in real life my intensity values are not uniformly distributed, I assume they are binned counts from an underlying multivariate normal distribution but I don't know how to code that as a reproducible example).
set.seed(42)
tibble(
x = rep(0:7, each = 8),
y = rep(0:7, 8),
intensity = sample(0:10, 64, replace = TRUE)
)
The x value here is the horizontal index of a pixel, the y value is the vertical index of a pixel and intensity is the value of that pixel according to a heatmap. I have managed to find a "centre" of the heatmap by marginalising these intensity values and finding the marginalised mean for x and y, but how would I perform a hypothesis test on whether the underlying multivariate normal distribution was centered around a certain point? In this case I would like to have a test statistic (more specifically a -log10 p-value) as to whether the underlying multivariate normal distibution that generated this count data is centered around the point c(3.5, 3.5).
Furthermore, I would also like a test statistic (again, more specifically a -log10 p-value) as to whether the underlying distribution that generated the count data actually is multivariate normal.
This is all part of a larger pipeline where I would like to use dplyr and group_by to perform this test on multiple heatmaps at once so if it is possible to keep this in tidy format that would be great.
A little bit of googling finds this web page which suggests mvnormtest::mshapiro.test.
mshap <- function(z, nrow = round(sqrt(length(z)))) {
mvnormtest::mshapiro.test(matrix(z, nrow = nrow))
}
mshap(dd$intensity)
If you want to make this more tidy-like you could do something with map/nest/etc..
I'm not quite sure how to test the centering hypothesis (likelihood ratio test using mnormt::dmnorm ?)

Probability Density (pdf) extraction in R

I am attempting to reproduce the above function in R. The numerator has the product of the probability density function (pdf) of "y" at time "t". The omega_t is simply the weight (which for now lets ignore). The i stands for each forecast of y (along with the density) derived for model_i, at time t.
The denominator is the integral of the above product. My question is: How to estimate the densities. To get the density of the variable one needs some datapoints. So far I have this:
y<-c(-0.00604,-0.00180,0.00292,-0.0148)
forecastsy_model1<-c(-0.0183,0.00685) # respectively time t=1 and t=2 of the forecasts
forecastsy_model2<-c(-0.0163,0.00931) # similarly
all.y.1<-c(y,forecasty_model1) #together in one vector
all.y.2<-c(y,forecasty_model2) #same
However, I am not aware how to extract the density of x1 for time t=1, or t=6, in order to do the products. I have considered this to find the density estimated using this:
dy1<-density(all.y.1)
which(dy1$x==0.00685)
integer(0) #length(dy1$x) : 512
with dy1$x containing the n coordinates of the points where the density is estimated, according to the documentation. Shouldn't n be 6, or at least contain the points of y that I have supplied? What is the correct way to extract the density (pdf) of y?
There is an n argument in density which defaults to 512. density returns you estimated density values on a relatively dense grid so that you can plot the density curve. The grid points are determined by the range of your data (plus some extension) and the n value. They produce a evenly spaced grid. The sampling locations may not lie exactly on this grid.
You can use linear interpolation to get density value anywhere covered by this grid:
Find the probability density of a new data point using "density" function in R
Exact kernel density value for any point in R

The optimal grid size for 2D kernal density distribution in R

I am generating 2D kernal density distributions for every pair of numeric columns in a data set, using kde2d function in the MASS package.
This takes the following parameters:
kde2d(x, y, h, n=25, lims = c(range(x), range(y)))
where n is the "Number of grid points in each direction. Can be scalar or a length-2 integer vector".
I want to optimize the dimensions of the grid for every pair of columns. At the moment, I used a fixed dimensions of 10x10. Does anyone know a formula for optimizing the grid size so I can generate optimal density estimations for each pair of columns?
Thanks
The parameter n in this function does not influence your density estimation but only the graphical representation, i.e. it should only depend on the size of the plot you want to create but not on the data.
On the other hand your density estimation is indeed influenced by the choice og bandwith h. To choose an optimal bandwith you will need to know (or assume) the distribution of your data

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Fitting Model Parameters To Histogram Data in R

So I've got a data set that I want to parameterise but it is not a Gaussian distribution so I can't parameterise it in terms of it's mean and standard deviation. I want to fit a distribution function with a set of parameters and extract the values of the parameters (eg. a and b) that give the best fit. I want to do this exactly the same as the
lm(y~f(x;a,b))
except that I don't have a y, I have a distribution of different x values.
Here's an example. If I assume that the data follows a Gumbel, double exponential, distribution
f(x;u,b) = 1/b exp-(z + exp-(z)) [where z = (x-u)/b]:
#library(QRM)
#library(ggplot2)
rg <- rGumbel(1000) #default parameters are 0 and 1 for u and b
#then plot it's distribution
qplot(rg)
#should give a nice skewed distribution
If I assume that I don't know the distribution parameters and I want to perform a best fit of the probability density function to the observed frequency data, how do I go about showing that the best fit is (in this test case), u = 0 and b = 1?
I don't want code that simply maps the function onto the plot graphically, although that would be a nice aside. I want a method that I can repeatedly use to extract variables from the function to compare to others. GGPlot / qplot was used as it quickly shows the distribution for anyone wanting to test the code. I prefer to use it but I can use other packages if they are easier.
Note: This seems to me like a really obvious thing to have been asked before but I can't find one that relates to histogram data (which again seems strange) so if there's another tutorial I'd really like to see it.

Resources