Implementing a different Kernel for 2D Kernel Density Estimation in R - r

I'm looking for some help understanding how to implement a 2-dimensional kernel density method, with a isotropic variance, and a bivariate normal kernel, kind of, but instead of using the typical distance, because the data is on the surface of the earth, I need to use a great-circle distance.
I'd like to replicate this in R, but I can't figure out how to use a distance metric other than the simple euclidean distance for any of the built in estimators, and since it uses a complex method with convolutions to add the kernels. Does anyone have a way to program an arbitrary kernel?

I ended up modifying the kde2d function from the MASS library. Some significant revision was needed, as is shown below. That said, the code is very flexible, allowing an arbitrary 2-d kernel to be used. (rdist.earth() was used for the great circle distance, h is the chosen bandwidth, in this case, in km, and n is the number of grid points in each direction to be used. rdist.earth requires the "fields" library)
The function could be modified to perform calculations in more than 2d, but the grid gets large very fast in higher dimensions. (Not that it's small now.)
Comments and suggestions on elegance or performance are welcome!
kde2d_mod <- function (data, h, n = 200, lims = c(range(data$lat), range(data$lon))) {
#Data is a matrix: lon,lat for each source. (lon,lat to match rdist.earth format.)
print(Sys.time()) #for timing
nx <- dim(data)[1]
if (dim(data)[2] != 2)
stop("data vectors have only lat-long data")
if (any(!is.finite(data)))
stop("missing or infinite values in the data are not allowed")
if (any(!is.finite(lims)))
stop("only finite values are allowed in 'lims'")
#Grid:
g<-grid(n,lims) #Function to create grid.
#The distance matrix gets large... Can we work around it? YES WE CAN!
sets<-ceiling(dim(g)[1]/10000)
#Allocate our output:
z<-rep(as.double(0),dim(g)[1])
for (i in (1:sets)-1) {
g_subset=g[(i*10000+1):(min((i+1)*10000,dim(g)[1])),]
a_matrix<-rdist.earth(g_subset,data,miles=FALSE)
z[(i*10000+1):(min((i+1)*10000,dim(g)[1]))]<- apply( #Here is my kernel...
a_matrix,1,FUN=function(X)
{sum(exp(-X^2/(2*(h^2))))/(2*pi*nx)}
)
rm(a_matrix)
}
print(Sys.time())
#Un-transpose the final data.
z<-t(matrix(z,n,n))
dim(z)<-c(n^2,1)
z<-as.vector(z)
return(z)
}
The key point here is that any kernel can be used in that inner loop; the downside is that this is evaluated at grid points, so a high-res grid is needed to run this; FFT would be great, but I didn't attempt it.
Grid Function:
grid<- function(n,lims) {
num <- rep(n, length.out = 2L)
gx <- seq.int(lims[1L], lims[2L], length.out = num[1L])
gy <- seq.int(lims[3L], lims[4L], length.out = num[2L])
v1=rep(gy,length(gx))
v2=rep(gx,length(gy))
v1<-matrix(v1, nrow=length(gy), ncol=length(gx))
v2<-t(matrix(v2, nrow=length(gx), ncol=length(gy)))
grid_out<-c(unlist(v1),unlist(v2))
grid_out<-aperm(array(grid_out,dim=c(n,n,2)),c(3,2,1) ) #reshape
grid_out<-unlist(as.list(grid_out))
dim(grid_out)<-c(2,n^2)
grid_out<-t(grid_out)
return(grid_out)
}
You can plot the values using image.plot, with the v1 and v2 matrices for your x,y points:
kde2d_mod_plot<-function(kde2d_mod_output,n,lims) ){
num <- rep(n, length.out = 2L)
gx <- seq.int(lims[1L], lims[2L], length.out = num[1L])
gy <- seq.int(lims[3L], lims[4L], length.out = num[2L])
v1=rep(gy,length(gx))
v2=rep(gx,length(gy))
v1<-matrix(v1, nrow=length(gy), ncol=length(gx))
v2<-t(matrix(v2, nrow=length(gx), ncol=length(gy)))
image.plot(v1,v2,matrix(kde2d_mod_output,n,n))
map('world', fill = FALSE,add=TRUE)
}

Related

Calculation of allowed space within monte carlo simulated data of 3 variables (cube in 3D coordinates)

I´m working on the topic of calculating the robust working range of a process. For this purpose I´m building models from DOE data and simulating data with a monte carlo approach. Filtering the data with a criteria for the response leads to a allowed space (see plots for better visualization).
In the example below, there are 3 variables and the goal is to calculate the biggest possible square (in parallel with the axis) within the allowed room. This would describe the working range of the process. The coding is just to get every variable in the same range (-1 to 1).
library(tidyverse)
library(MASS)
library(ggplot2)
library(gridExtra)
library(rgl)
df<-data.frame(
X1=runif(100,0,2),
X2=runif(100,10,30),
X3=runif(100,5,75))%>%
mutate(Y1=2*X1-2*X2+X3)
f1<-Y1~X1+X2+X3
model1<- lm(f1, data=df)
m.c <- NULL
n=10000
for (k in 1:n)
{
X1=runif(1,0,2)
X2=runif(1,10,30)
X3=runif(1,5,75)
m.c = rbind(m.c, data.frame(X1, X2, X3))
}
m.c_coded<-m.c%>%
mutate(predict1=predict(model1, newdata = .))%>%
mutate(X1=(X1-1/1))%>%
mutate(X2=(X2-20)/10)%>%
mutate(X3=(X3-40)/35)
Space<- m.c_coded%>%
filter(predict1<=0)
p1<-ggplot(Space)+
geom_point(aes(X1, X2))+
xlim(-1,1)+
ylim(-1,1)
p2<-ggplot(Space)+
geom_point(aes(X1, X3))+
xlim(-1,1)+
ylim(-1,1)
p3<-ggplot(Space)+
geom_point(aes(X2, X3))+
xlim(-1,1)+
ylim(-1,1)
grid.arrange(arrangeGrob(p1,p2,p3, nrow = 1), nrow = 1)
MODR_plot3D<-plot3d( x=Space$X1, y=Space$X2, z=Space$X3, type = "p",
xlim = (c(-1,1)), ylim(c(-1,1)), zlim = (c(-1,1))
)
There are specialized programms for that (DOE software) which can calculate this so called Design-space, but I want to implement it in my R skript. Sadly I do not have any idea, how I can calculate the position (edges) of this square. My approach would be to find the maximum distance to the surface on (center of the square).
Does anyone an idea how I can calculate this cube in a proper way? If possible I want to extend this also for the n-dimensional room.

R: Inverse fft() to confirm my manual DFT algorithm inaccurate?

Using R, before assessing some metric of accuracy on my own manual implementation of DFT, I wanted to do a sanity check on how well stats::fft() performs by doing the following:
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
sig.rt = fft(fft(sig.ts)/N, inverse="true");
#the two plots so perfectly align that you can't see them both
max(abs(sig.ts - sig.rt)) / max(sig.ts);
#arbitrary crude accuracy metric=1.230e-15 - EXCELLENT!
But I wanted to write the code for DFT myself, to ensure I understand it, then invert it in the hopes that it would be the same:
##The following is the slow DFT for now, not the FFT...
sR = 102.4; #the number of Hz at which we sample
freq1=3; freq2=12; #frequency(ies) of the wave
t = seq(1/sR,10, 1/sR);
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
N=length(t); kk=seq(0,N/2-1, 1); nn=seq(0,N-1, 1);
for(k in kk){
sig.freqd[k]=0;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n+1]*exp(-j*2*pi*n*k/N); } }
sig.freqd = (1/N)*sig.freqd; #for Normalization
#Checking the "accuracy" of my manual implementation of DFT...
sig.freqd_inv=Re(fft(sig.freqd, inverse="true"));
plot(t[1:100], window(sig.ts,end=100), col="black", type="l",lty=1,lwd=1, xaxt="n");
lines(t[1:100],window(sig.freqd_inv,end=100), col="red", type="l",lty=1,lwd=1, xaxt="n");
axis(1, at=seq(round(t[1],1),round(t[length(t)],1), by=0.1), las=2);
max(abs(sig.ts[1:(N/2-1)] - sig.freqd_inv)) / max(sig.ts[1:(N/2-1)]); #the metric here =1.482 unfortunately
Even without the metric, the plot makes it obvious that something's off here - it's lower amplitude, maybe out of phase, and more jagged. In all of my self-studying, I will say that I am a bit confused about how sensitive this all is to vector length..as well as how to ensure that the imaginary component's phase information is taken into account when plotting.
Bottom line, any insight into what's wrong with my DFT algorithm would be helpful. I don't want to just blackbox my use of functions - I want to understand these things more deeply before moving on to more complicated functions.
Thanks,
Christian
The main issues arise from the signal indexing. First to get a full transform usable by R's fft(..., inverse = TRUE), you would need to compute all N coefficients (even if the coefficients above N/2-1 could be obtained by symmetry).
Then you should realize that array indexing in R are 1-based. So, while indexing sig.freqd[k], the index k should start at 1 instead of 0. Since the argument to exp(-1i*2*pi*n*k/N) should start with n=0andk=0`, you'll need to adjust the indices:
kk=seq(1,N, 1); nn=seq(1,N, 1);
for(k in kk){
sig.freqd[k]=0i;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n]*exp(-1i*2*pi*(n-1)*(k-1)/N);
}
}
I've also changed you usage of j to represent the imaginary number 1i since that's the usual notation recognized by R (and R was complaining about it when trying your posted sample as-is). If you had defined j=1i that shouldn't affect the results.
Note also that R's fft is unnormalized. So to obtain the same result for the forward transform, your DFT implementation should not include the 1/N normalization. On the other hand, you will need to add this factor as a final step in order to get the full-circle forward+backward transform to match the original signal.
With these changes you should have the following code:
##The following is the slow DFT for now, not the FFT...
sR = 102.4; #the number of Hz at which we sample
freq1=3; freq2=12; #frequency(ies) of the wave
t = seq(1/sR,10, 1/sR);
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
N=length(t); kk=seq(1,N, 1); nn=seq(1,N, 1);
for(k in kk){
sig.freqd[k]=0i;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n]*exp(-1i*2*pi*(n-1)*(k-1)/N);
}
}
#Checking the "accuracy" of my manual implementation of DFT...
sig.freqd_inv=(1/N)*Re(fft(sig.freqd, inverse="true"));
plot(t[1:100], window(sig.ts,end=100), col="black", type="l",lty=1,lwd=2, xaxt="n");
lines(t[1:100],window(sig.freqd_inv,end=100), col="red", type="l",lty=2,lwd=1, xaxt="n");
axis(1, at=seq(round(t[1],1),round(t[length(t)],1), by=0.1), las=2);
max(abs(sig.ts - sig.freqd_inv)) / max(sig.ts)
This should yield a metric around 1.814886e-13, which is probably more in line with what you were expecting. The corresponding plot should also be showing the orignal signal and the roundtrip signal overlapping:

How do I find level sets for a function on R^d, in R?

I am looking for an efficient way to find level sets of an arbitrary function from [0,1]^d to R.
To be clear: with a level set I mean the set of points in [0,1]^d that are mapped to the same value.
In all of my applications, the level sets are connected. They are lines, planes, or some higher dimensional hyperplane, but apart from the connectedness, they do not satisfy some general criterium.
I am looking for a subset of the level set that has a high density everywhere.
When I limit my functions to 2d, I can use the function contourLines from the package grDevices, which does exactly what I am looking for:
test <- function(x,y){
y-(x^2-6*x+9)
}
Mat = matrix(0,100,100)
x <- seq(-10,10,length.out = 100)
y <- seq(-10,10,length.out = 100)
for(i in (1:100)){
for(j in (1:100)){
Mat[i,j] = test(x[i],y[j])
}
}
cont <- contourLines(x, y, Mat, levels = 0)
Unfortunately I have not been able to find a function that does the same trick in higher dimensions.
To give a bit more context to the problem:
I have a 'wild' function, of which I hardly know anything, but I can easily evaluate it at any point in R^d. This function divides the R^d (or [0,1]^d, to make it a bit simpler), into a positive part (level sets larger than 0), and a negative part (level sets smaller than 0). I am looking for the boundary separating the two, which is the level set for 0.

Drawing a smooth implicit surface with misc3d

The misc3d package provides a great implementation of the marching cubes algorithm, allowing to plot implicit surfaces.
For example, let's plot a Dupin cyclide:
a = 0.94; mu = 0.56; c = 0.34 # cyclide parameters
f <- function(x, y, z, a, c, mu){ # implicit equation f(x,y,z)=0
b <- sqrt(a^2-c^2)
(x^2+y^2+z^2-mu^2+b^2)^2 - 4*(a*x-c*mu)^2 - 4*b^2*y^2
}
# define the "voxel"
nx <- 50; ny <- 50; nz <- 25
x <- seq(-c-mu-a, abs(mu-c)+a, length=nx)
y <- seq(-mu-a, mu+a, length=ny)
z <- seq(-mu-c, mu+c, length=nz)
g <- expand.grid(x=x, y=y, z=z)
voxel <- array(with(g, f(x,y,z,a,c,mu)), c(nx,ny,nz))
# plot the surface
library(misc3d)
surf <- computeContour3d(voxel, level=0, x=x, y=y, z=z)
drawScene.rgl(makeTriangles(surf))
Nice, except that the surface is not smooth.
The documentation of drawScene.rgl says: "Object-specific rendering features such as smoothing and material are controlled by setting in the objects." I don't know what does that mean. How to get a smooth surface?
I have a solution but not a straightforward one: this solution consists in building a mesh3d object from the output of computeContour3d, and to include the surface normals in this mesh3d.
The surface normals of an implicit surface defined by f(x,y,z)=0 are simply given by the gradient of f. It is not hard to derive the gradient for this example.
gradient <- function(xyz,a,c,mu){
x <- xyz[1]; y <- xyz[2]; z <- xyz[3]
b <- sqrt(a^2-c^2)
c(
2*(2*x)*(x^2+y^2+z^2-mu^2+b^2) - 8*a*(a*x-c*mu),
2*(2*y)*(x^2+y^2+z^2-mu^2+b^2) - 8*b^2*y,
2*(2*z)*(x^2+y^2+z^2-mu^2+b^2)
)
}
Then the normals are computed as follows:
normals <- apply(surf, 1, function(xyz){
gradient(xyz,a,c,mu)
})
Now we are ready to make the mesh3d object:
mesh <- list(vb = rbind(t(surf),1),
it = matrix(1:nrow(surf), nrow=3),
primitivetype = "triangle",
normals = rbind(-normals,1))
class(mesh) <- c("mesh3d", "shape3d")
And finally to plot it with rgl:
library(rgl)
shade3d(mesh, color="red")
Nice, the surface is smooth now.
But is there a more straightforward way to get a smooth surface, without building a mesh3d object? What do they mean in the documentation: "Object-specific rendering features such as smoothing and material are controlled by setting in the objects."?
I don't know what the documentation is suggesting. However, you can do it via a mesh object slightly more easily than you did (though the results aren't quite as nice), using the addNormals() function to calculate the normals automatically rather than by formula.
Here are the steps:
Compute the surface as you did.
Create the mesh without normals. This is basically what you did, but using tmesh3d():
mesh <- tmesh3d(t(surf), matrix(1:nrow(surf), nrow=3), homogeneous = FALSE)
Calculate which vertices are duplicates of which others:
verts <- apply(mesh$vb, 2, function(column) paste(column, collapse = " "))
firstcopy <- match(verts, verts)
Rewrite the indices to use the first copy. This is necessary, since the misc3d functions give a collection of disconnected triangles; we need to work out which are connected.
it <- as.numeric(mesh$it)
it <- firstcopy[it]
dim(it) <- dim(mesh$it)
mesh$it <- it
At this point, there are a lot of unused vertices in the mesh; if memory was a problem you might want to add a step to remove them. I'm going to skip that.
Add the normals
mesh <- addNormals(mesh)
Here are the before and after shots. Left is without normals, right is with them.
It's not quite as smooth as your solution using computed normals, but it's not always easy to find those.
There's an option smooth in the makeTriangles function:
drawScene.rgl(makeTriangles(surf, smooth=TRUE))
I think the result is equivalent to #user2554330's solution, but this is more straightforward.
EDIT
The result is highly better with the rmarchingcubes package:
library(rmarchingcubes)
contour_shape <- contour3d(
griddata = voxel, level = 0,
x = x, y = y, z = z
)
library(rgl)
tmesh <- tmesh3d(
vertices = t(contour_shape[["vertices"]]),
indices = t(contour_shape[["triangles"]]),
normals = contour_shape[["normals"]],
homogeneous = FALSE
)
open3d(windowRect = c(50, 50, 562, 562))
view3d(zoom=0.8)
shade3d(tmesh, color = "darkred")

spatial filtering by proximity in R

I have occurrence points for a species, and I'd like to remove potential sampling bias (where some regions might have much greater density of points than others). One way to do this would be to maximize a subset of points that are no less than a certain distance X of each other. Essentially, I would prevent points from being too close to each other.
Are there any existing R functions to do this? I've searched through various spatial packages, but haven't found anything, and can't figure out exactly how to implement this myself.
An example occurrence point dataset can be downloaded here.
Thanks!
I've written a new version of this function that no longer really follows rMaternII.
The input can either be a SpatialPoints, SpatialPointsDataFrame or matrix object.
Seems to work well, but suggestions welcome!
filterByProximity <- function(xy, dist, mapUnits = F) {
#xy can be either a SpatialPoints or SPDF object, or a matrix
#dist is in km if mapUnits=F, in mapUnits otherwise
if (!mapUnits) {
d <- spDists(xy,longlat=T)
}
if (mapUnits) {
d <- spDists(xy,longlat=F)
}
diag(d) <- NA
close <- (d <= dist)
diag(close) <- NA
closePts <- which(close,arr.ind=T)
discard <- matrix(nrow=2,ncol=2)
if (nrow(closePts) > 0) {
while (nrow(closePts) > 0) {
if ((!paste(closePts[1,1],closePts[1,2],sep='_') %in% paste(discard[,1],discard[,2],sep='_')) & (!paste(closePts[1,2],closePts[1,1],sep='_') %in% paste(discard[,1],discard[,2],sep='_'))) {
discard <- rbind(discard, closePts[1,])
closePts <- closePts[-union(which(closePts[,1] == closePts[1,1]), which(closePts[,2] == closePts[1,1])),]
}
}
discard <- discard[complete.cases(discard),]
return(xy[-discard[,1],])
}
if (nrow(closePts) == 0) {
return(xy)
}
}
Let's test it:
require(rgeos)
require(sp)
pts <- readWKT("MULTIPOINT ((3.5 2), (1 1), (2 2), (4.5 3), (4.5 4.5), (5 5), (1 5))")
pts2 <- filterByProximity(pts,dist=2, mapUnits=T)
plot(pts)
axis(1)
axis(2)
apply(as.data.frame(pts),1,function(x) plot(gBuffer(SpatialPoints(coords=matrix(c(x[1],x[2]),nrow=1)),width=2),add=T))
plot(pts2,add=T,col='blue',pch=20,cex=2)
There is also an R package called spThin that performs spatial thinning on point data. It was developed for reducing the effects of sampling bias for species distribution models, and does multiple iterations for optimization. The function is quite easy to implement---the vignette can be found here. There is also a paper in Ecography with details about the technique.
Following Josh O'Brien's advice, I looked at spatstat's rMaternI function, and came up with the following. It seems to work pretty well.
The distance is in map units. It would be nice to incorporate one of R's distance functions that always returns distances in meters, rather than input units, but I couldn't figure that out...
require(spatstat)
require(maptools)
occ <- readShapeSpatial('occurrence_example.shp')
filterByProximity <- function(occ, dist) {
pts <- as.ppp.SpatialPoints(occ)
d <- nndist(pts)
z <- which(d > dist)
return(occ[z,])
}
occ2 <- filterByProximity(occ,dist=0.2)
plot(occ)
plot(occ2,add=T,col='blue',pch=20)
Rather than removing data points, you might consider spatial declustering. This involves giving points in clusters a lower weight than outlying points. The two simplest ways to do this involve a polygonal segmentation, like a Voronoi diagram, or some arbitrary grid. Both methods will weight points in each region according to the area of the region.
For example, if we take the points in your test (1,1),(2,2),(4.5,4.5),(5,5),(1,5) and apply a regular 2-by-2 mesh, where each cell is three units on a side, then the five points fall into three cells. The points ((1,1),(2,2)) falling into the cell [0,3]X[0,3] would each have weights 1/( no. of points in current cell TIMES tot. no. of occupied cells ) = 1 / ( 2 * 3 ). The same thing goes for the points ((4.5,4.5),(5,5)) in the cell (3,6]X(3,6]. The "outlier", (1,5) would have a weight 1 / ( 1 * 3 ). The nice thing about this technique is that it is a quick way to generate a density based weighting scheme.
A polygonal segmentation involves drawing a polygon around each point and using the area of that polygon to calculate the weight. Generally, the polygons completely cover the entire region, and the weights are calculated as the inverse of the area of each polygon. A Voronoi diagram is usually used for this, but polygonal segmentations may be calculated using other techniques, or may be specified by hand.

Resources