Create a Joint Kernel Density Function in R - r

I have two vectors, A and B. Can I create a joint kernel density function empirically in R such that I have some function f(x,y) that comes from vectors A and B? The following example wouldn't work, as it isn't a joint probability distribution
z <- cbind(A,B)
approxfun(density(z))

MASS::kde2d does two-dimensional kernel density estimation given two vectors (of x and y coordinates). Rather than return a function that can be evaluated at an arbitrary ({newx,newy}), though, it returns the function evaluated on a square grid.
Once you've done the fussy bits like selecting a bandwidth, the actual computations for kernel density estimation at a single point x0,y0 aren't that hard, I think it would be something like
sum(dnorm((x0-x)/h)*dnorm((y0-y)/h)
MASS::kde2d does clever stuff with outer() and tcrossprod() to compute the distances from all the data points to all of the points on the evaluation grids, and all of the sums, in a small number of top-level operations, but I think what I have above is the crux of it.

Related

Is this dct (FFTW.jl) behavior in julia normal?

I'm trying to do some exercises of Compressed Sensing on Julia, but i realize that the discrete cosine transformation (using FFTW.jl) of an identity matrix doesn't looks as the result of other programming languages (aka. Mathematica and Matlab).
For example in Julia
using Plots, FFTW, LinearAlgebra
n = 100
Psi = dct(Matrix(1.0I,n,n))
heatmap(Psi)
results in this matrix (which is essentially an identity matrix with some noise)
But in Matlab
imagesc(dct(eye(100,100),'Type',2))
this is the result (as expected)
Finally in Mathematica
MatrixPlot[N[FourierDCTMatrix[100, 2]], PlotLegends -> Automatic]
returns this
Why Julia behaves so differently?
And is this normal?
Matlab (and I guess Mathematica), does dct of each column in your matrix. FFTW performs a 2-dimensional dct when the input is two-dimensional. The same happens for fft.
If you want column-wise transformation, you can specify the dimension:
Psi1 = dct(Matrix(1.0I,n,n), 1); # along first dimension
heatmap(Psi1)
Notice that the direction of the y-axis is opposite for Plots.jl relative to Matlab.
(BTW, you can also just write I(n) or 1.0I(n) instead of Matrix(1.0I,n,n))
This is something that sets Julia apart from some other languages. It tends to treat matrices as matrices, and not as just a collection of vectors or a bunch of scalars. For example exp(M) and log(M) for matrices not operate elementwise, but will calculate the matrix exponential and matrix logarithm according to their linear algebra definitions.

Using FFT in R to Determine Density Function for IID Sum

The goal is to compute the density function of a sum of n IID random variables via the density function of one of these random variables by:
Transforming the density function into the characteristic function via fft
Raise the characteristic function to the n
Transform the resulting characteristic function into the density function of interest via fft(inverse=TRUE)
The below is my naive attempt at this:
sum_of_n <- function(density, n, xstart, xend, power_of_2)
{
x <- seq(from=xstart, to=xend, by=(xend-xstart)/(2^power_of_2-1))
y <- density(x)
fft_y <- fft(y)
fft_sum_of_y <- (fft_y ^ n)
sum_of_y <- Re(fft(fft_sum_of_y, inverse=TRUE))
return(sum_of_y)
}
In the above, density is an arbitrary density function: for example
density <- function(x){return(dgamma(x = x, shape = 2, rate = 1))}
n indicates the number of IID random variables being summed. xstart and xend are the start and end of the approximate support of the random variable. power_of_2 is the power of 2 length for the numeric vectors used. As I understand things, lengths of powers of two increase the efficiency of the fft algorithm.
I understand at least partially why the above does not work as intended in general. Firstly, the values themselves will not be scaled correctly, as fft(inverse=TRUE) does not normalize by default. However, I find that the values are still not correct when I divide by the length of the vector i.e.
sum_of_y <- sum_of_y / length(sum_of_y)
which based on my admittedly limited understanding of fft is the normalizing calculation. Secondly, the resulting vector will be out of phase due to (someone correct me on this if I am wrong) the shifting of the zero frequency that occurs when fft is performed. I have tried to use, for example, pracma's fftshift and ifftshift, but they do not appear to address this problem correctly. For symmetric distributions e.g. normal, this is not difficult to address since the phase shift is typically exactly half, so that an operation like
sum_of_y <- c(sum_of_y[(length(y)/2+1):length(y)], sum_of_y[1:(length(y)/2)])
works as a correction. However, for asymmetric distributions like the gamma distribution above this fails.
In conclusion, are there adjustments to the code above that will result in an appropriately scaled and appropriately shifted final density function for the IID sum?

The optimal grid size for 2D kernal density distribution in R

I am generating 2D kernal density distributions for every pair of numeric columns in a data set, using kde2d function in the MASS package.
This takes the following parameters:
kde2d(x, y, h, n=25, lims = c(range(x), range(y)))
where n is the "Number of grid points in each direction. Can be scalar or a length-2 integer vector".
I want to optimize the dimensions of the grid for every pair of columns. At the moment, I used a fixed dimensions of 10x10. Does anyone know a formula for optimizing the grid size so I can generate optimal density estimations for each pair of columns?
Thanks
The parameter n in this function does not influence your density estimation but only the graphical representation, i.e. it should only depend on the size of the plot you want to create but not on the data.
On the other hand your density estimation is indeed influenced by the choice og bandwith h. To choose an optimal bandwith you will need to know (or assume) the distribution of your data

Mahalonobis distance in R, error: system is computationally singular

I'd like to calculate multivariate distance from a set of points to the centroid of those points. Mahalanobis distance seems to be suited for this. However, I get an error (see below).
Can anyone tell me why I am getting this error, and if there is a way to work around it?
If you download the coordinate data and the associated environmental data, you can run the following code.
require(maptools)
occ <- readShapeSpatial('occurrences.shp')
load('envDat.Rdata')
#standardize the data to scale the variables
dat <- as.matrix(scale(dat))
centroid <- dat[1547,] #let's assume this is the centroid in this case
#Calculate multivariate distance from all points to centroid
mahalanobis(dat,center=centroid,cov=cov(dat))
Error in solve.default(cov, ...) :
system is computationally singular: reciprocal condition number = 9.50116e-19
The Mahalanobis distance requires you to calculate the inverse of the covariance matrix. The function mahalanobis internally uses solve which is a numerical way to calculate the inverse. Unfortunately, if some of the numbers used in the inverse calculation are very small, it assumes that they are zero, leading to the assumption that it is a singular matrix. This is why it specifies that they are computationally singular, because the matrix might not be singular given a different tolerance.
The solution is to set the tolerance for when it assumes that they are zero. Fortunately, mahalanobis allows you to pass this parameter (tol) to solve:
mahalanobis(dat,center=centroid,cov=cov(dat),tol=1e-20)
# [1] 24.215494 28.394913 6.984101 28.004975 11.095357 14.401967 ...
mahalanobis uses the covariance matrix, cov, (more precisely the inverse of it) to transform the coordinate system, then compute Euclidian distance in the new coordinates. A standard reference is Duda & Hart "Pattern Classification and Scene Recognition"
Looks like your cov matrix is singular. Perhaps there are linearly-dependent columns in "dat" that are unnecessary? Setting the tolerance to zero won't help if
the covariance matrix is truly singular. The first thing to do, instead, is look for columns that might be a rescaling of some other column, or might be just a sum of 2 or more other columns and remove them. Such columns are redundant for the mahalanobis distance.
BTW, since mahalanobis distance is effectively a rescaling and rotation, calling the scaling function looks superfluous - any reason why you want that?

approximation methods

I attached image:
(source: piccy.info)
So in this image there is a diagram of the function, which is defined on the given points.
For example on points x=1..N.
Another diagram, which was drawn as a semitransparent curve,
That is what I want to get from the original diagram,
i.e. I want to approximate the original function so that it becomes smooth.
Are there any methods for doing that?
I heard about least squares method, which can be used to approximate a function by straight line or by parabolic function. But I do not need to approximate by parabolic function.
I probably need to approximate it by trigonometric function.
So are there any methods for doing that?
And one idea, is it possible to use the Least squares method for this problem, if we can deduce it for trigonometric functions?
One more question!
If I use the discrete Fourier transform and think about the function as a sum of waves, so may be noise has special features by which we can define it and then we can set to zero the corresponding frequency and then perform inverse Fourier transform.
So if you think that it is possible, then what can you suggest in order to identify the frequency of noise?
Unfortunately many solutions here presented don't solve the problem and/or they are plain wrong.
There are many approaches and they are specifically built to solve conditions and requirements you must be aware of !
a) Approximation theory: If you have a very sharp defined function without errors (given by either definition or data) and you want to trace it exactly as possible, you are using
polynominal or rational approximation by Chebyshev or Legendre polynoms, meaning that you
approach the function by a polynom or, if periodical, by Fourier series.
b) Interpolation: If you have a function where some points (but not the whole curve!) are given and you need a function to get through this points, you can use several methods:
Newton-Gregory, Newton with divided differences, Lagrange, Hermite, Spline
c) Curve fitting: You have a function with given points and you want to draw a curve with a given (!) function which approximates the curve as closely as possible. There are linear
and nonlinear algorithms for this case.
Your drawing implicates:
It is not remotely like a mathematical function.
It is not sharply defined by data or function
You need to fit the curve, not some points.
What do you want and need is
d) Smoothing: Given a curve or datapoints with noise or rapidly changing elements, you only want to see the slow changes over time.
You can do that with LOESS as Jacob suggested (but I find that overkill, especially because
choosing a reasonable span needs some experience). For your problem, I simply recommend
the running average as suggested by Jim C.
http://en.wikipedia.org/wiki/Running_average
Sorry, cdonner and Orendorff, your proposals are well-minded, but completely wrong because you are using the right tools for the wrong solution.
These guys used a sixth polynominal to fit climate data and embarassed themselves completely.
http://scienceblogs.com/deltoid/2009/01/the_australians_war_on_science_32.php
http://network.nationalpost.com/np/blogs/fullcomment/archive/2008/10/20/lorne-gunter-thirty-years-of-warmer-temperatures-go-poof.aspx
Use loess in R (free).
E.g. here the loess function approximates a noisy sine curve.
(source: stowers-institute.org)
As you can see you can tweak the smoothness of your curve with span
Here's some sample R code from here:
Step-by-Step Procedure
Let's take a sine curve, add some
"noise" to it, and then see how the
loess "span" parameter affects the
look of the smoothed curve.
Create a sine curve and add some noise:
period <- 120 x <- 1:120 y <-
sin(2*pi*x/period) +
runif(length(x),-1,1)
Plot the points on this noisy sine curve:
plot(x,y, main="Sine Curve +
'Uniform' Noise") mtext("showing
loess smoothing (local regression
smoothing)")
Apply loess smoothing using the default span value of 0.75:
y.loess <- loess(y ~ x, span=0.75,
data.frame(x=x, y=y))
Compute loess smoothed values for all points along the curve:
y.predict <- predict(y.loess,
data.frame(x=x))
Plot the loess smoothed curve along with the points that were already
plotted:
lines(x,y.predict)
You could use a digital filter like a FIR filter. The simplest FIR filter is just a running average. For more sophisticated treatment look a something like a FFT.
This is called curve fitting. The best way to do this is to find a numeric library that can do it for you. Here is a page showing how to do this using scipy. The picture on that page shows what the code does:
(source: scipy.org)
Now it's only 4 lines of code, but the author doesn't explain it at all. I'll try to explain briefly here.
First you have to decide what form you want the answer to be. In this example the author wants a curve of the form
f(x) = p0 cos (2π/p1 x + p2) + p3 x
You might instead want the sum of several curves. That's OK; the formula is an input to the solver.
The goal of the example, then, is to find the constants p0 through p3 to complete the formula. scipy can find this array of four constants. All you need is an error function that scipy can use to see how close its guesses are to the actual sampled data points.
fitfunc = lambda p, x: p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x # Target function
errfunc = lambda p: fitfunc(p, Tx) - tX # Distance to the target function
errfunc takes just one parameter: an array of length 4. It plugs those constants into the formula and calculates an array of values on the candidate curve, then subtracts the array of sampled data points tX. The result is an array of error values; presumably scipy will take the sum of the squares of these values.
Then just put some initial guesses in and scipy.optimize.leastsq crunches the numbers, trying to find a set of parameters p where the error is minimized.
p0 = [-15., 0.8, 0., -1.] # Initial guess for the parameters
p1, success = optimize.leastsq(errfunc, p0[:])
The result p1 is an array containing the four constants. success is 1, 2, 3, or 4 if ths solver actually found a solution. (If the errfunc is sufficiently crazy, the solver can fail.)
This looks like a polynomial approximation. You can play with polynoms in Excel ("Add Trendline" to a chart, select Polynomial, then increase the order to the level of approximation that you need). It shouldn't be too hard to find an algorithm/code for that.
Excel can show the equation that it came up with for the approximation, too.

Resources