Related
I'm using the sf package in R to simulate a sample of agents moving between different nodes in a network across space-time.
I'm currently puzzled though by some behavior from st_intersects: I have the agents moving between nodes between each corner of the coordinate unit square as well as through the center at (.5,.5). However when I try to detect an agent at st_point(c(.1,.9)) intersecting with the geometry st_linestring(c(st_point(c(0,0)),st_point(c(0.5,0.5)))) I get an empty predicate return.
In contrast if I detect an agent moving along the x-axis or y-axis only, I am able to detect the point correctly. Why is this?
Minimum reproducible example in R v4.0.2:
library(sf)
l1 <- st_linestring(c(st_point(c(0,1)),st_point(c(0.5,0.5))))
p1 <- st_point(c(.1,.9)) ## on the line between (0,1) and (.5,.5); y=1-x x = f(t)
st_intersects(p1,l1) ## empty
#Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
# 1: (empty)
## in contrast
l2 <- st_linestring(st_point(c(0,0)),st_point(c(1,0)))
p2 <- st_point(c(.1,0)) ## on the line between (0,0) and (1,0) ; y = 0; x = f(t)
st_intersects(p2,l2) ## returns 1 as I would expect
#Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
# 1: 1
To elaborate a bit on the ege-rubak's answer: The point is off by tiny difference, most likely due to floating point math (which is inherently inaccurate).
As a workaround I suggest using sf::st_is_within_distance() with a sufficiently small dist value to eliminate rounding differences without introducing false positives / may require some tuning depending on the data used.
Consider this code, originally posted on the RStudio Community forum (where this question seems to have been cross posted): https://community.rstudio.com/t/simple-st-intersect-gives-unexpected-result/108214/3?u=jlacko
library(sf)
l1 <- st_linestring(c(st_point(c(0,1)),
st_point(c(0.5,0.5))))
p1 <- st_point(c(.1,.9)) ## on the line between (0,1) and (.5,.5); y=1-x x = f(t)
st_distance(l1, p1)[1,1]
# [1] 1.962616e-17
st_is_within_distance(p1,l1, 1/1000)
# Sparse geometry binary predicate list of length 1, where the
# predicate was `is_within_distance'
# 1: 1
Floating point geometri is inherently flawed. So your result is most likely due to the lack of precision in the computer representation of your real numbers. A possible workaround is to find the distance between the point and the line and accept the point as being on the line if the distance is smaller than some threshold.
Let's just say I have the following scatterplot:
set.seed(665544)
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
plot(x)
I want to divide this scatterplot into square cells of a specified size and then count how many points fall into each unique cell. This will essentially give me the local density value of that cell. What is the best way of doing this? Is there an R package that can help? Perhaps a 2D histogram method like in Matlab?
Quick clarifications:
1.) I'd like the function/method to take the following 3 arguments: dimensions of total area, dimensions of cell (OR number of cells), and the data. It would then perhaps output a matrix where each value corresponds to a cell's point count.
2.) Q: Why do you want to use this method to determine local density? Isn't this much easier:
library(dbscan)
pointdensity(x, eps = .1, type = "frequency")
A: This method calculates the local density around each point. Though easy, this definition of local density then makes it very difficult (optimization algorithms necessary) to assign new data in a way that it matches the local density distribution of the original data set.
I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.
I'm trying to estimate the area of the 95% contour of a kde object from the ks package in R.
If I use the example data set from the ks package, I would create the kernel object as follow:
library(ks)
data(unicef)
H.scv <- Hscv(x=unicef)
fhat <- kde(x=unicef, H=H.scv)
I can easily plot the 25, 50, 75% contour using the plot function:
plot(fhat)
But I want to estimate the area within the contour.
I saw a similar question here, but the answer proposed does not solve the problem.
In my real application, my dataset is a time series of coordinates of an animal and I want to measure the home range size of this animal using a bivariate normal kernel. I'm using ks package because it allows to estimate the bandwith of a kernel distribution with methods such as plug-in and smoothed cross-validation.
Any help would be really appreciated!
Here are two ways to do it. They are both fairly complex conceptually, but actually very simple in code.
fhat <- kde(x=unicef, H=H.scv,compute.cont=TRUE)
contour.95 <- with(fhat,contourLines(x=eval.points[[1]],y=eval.points[[2]],
z=estimate,levels=cont["95%"])[[1]])
library(pracma)
with(contour.95,polyarea(x,y))
# [1] -113.677
library(sp)
library(rgeos)
poly <- with(contour.95,data.frame(x,y))
poly <- rbind(poly,poly[1,]) # polygon needs to be closed...
spPoly <- SpatialPolygons(list(Polygons(list(Polygon(poly)),ID=1)))
gArea(spPoly)
# [1] 113.677
Explanation
First, the kde(...) function returns a kde object, which is a list with 9 elements. You can read about this in the documentation, or you can type str(fhat) at the command line, or, if you're using RStudio (highly recommended), you can see this by expanding the fhat object in the Environment tab.
One of the elements is $eval.points, the points at which the kernel density estimates are evaluated. The default is to evaluate at 151 equally spaced points. $eval.points is itself a list of, in your case 2 vectors. So, fhat$eval.points[[1]] represents the points along "Under-5" and fhat$eval.points[[2]] represents the points along "Ave life exp".
Another element is $estimate, which has the z-values for the kernel density, evaluated at every combination of x and y. So $estimate is a 151 X 151 matrix.
If you call kde(...) with compute.cont=TRUE, you get an additional element in the result: $cont, which contains the z-value in $estimate corresponding to every percentile from 1% to 99%.
So, you need to extract the x- and y-values corresponding to the 95% contour, and use that to calculate the area. You would do that as follows:
fhat <- kde(x=unicef, H=H.scv,compute.cont=TRUE)
contour.95 <- with(fhat,contourLines(x=eval.points[[1]],y=eval.points[[2]],
z=estimate,levels=cont["95%"])[[1]])
Now, contour.95 has the x- and y-values corresponding to the 95% contour of fhat. There are (at least) two ways to get the area. One uses the pracma package and calculates
it directly.
library(pracma)
with(contour.95,polyarea(x,y))
# [1] -113.677
The reason for the negative value has to do with the ordering of x and y: polyarea(...) is interpreting the polygon as a "hole", so it has negative area.
An alternative uses the area calculation routines in rgeos (a GIS package). Unfortunately, this requires you to first turn your coordinates into a "SpatialPolygon" object, which is a bit of a bear. Nevertheless, it is also straightforward.
library(sp)
library(rgeos)
poly <- with(contour.95,data.frame(x,y))
poly <- rbind(poly,poly[1,]) # polygon needs to be closed...
spPoly <- SpatialPolygons(list(Polygons(list(Polygon(poly)),ID=1)))
gArea(spPoly)
# [1] 113.677
Another method would be to use the contourSizes() function within the kde package. I've also been interested in using this package to compare both 2D and 3D space use in ecology, but I wasn't sure how to extract the 2D density estimates. I tested this method by estimating the area of an "animal" which was limited to the area of a circle with a known radius. Below is the code:
set.seed(123)
require(GEOmap)
require(kde)
# need this library for the inpoly function
# Create a data frame centered at coordinates 0,0
data = data.frame(x=0,y=0)
# Create a vector of radians from 0 to 2*pi for making a circle to
# test the area
circle = seq(0,2*pi,length=100)
# Select a radius for your circle
radius = 10
# Create a buffer for when you simulate points (this will be more clear below)
buffer = radius+2
# Simulate x and y coordinates from uniform distribution and combine
# values into a dataframe
createPointsX = runif(1000,min = data$x-buffer, max = data$x+buffer)
createPointsY = runif(1000,min = data$y-buffer, max = data$y+buffer)
data1 = data.frame(x=createPointsX,y=createPointsY)
# Plot the raw data
plot(data1$x,data1$y)
# Calculate the coordinates used to create a cirle with center 0,0 and
# with radius specified above
coords = as.data.frame(t(rbind(data$x+sin(circle)*radius,
data$y+cos(circle)*radius)))
names(coords) = c("x","y")
# Add circle to plot with red line
lines(coords$x,coords$y,col=2,lwd=2)
# Use the inpoly function to calculate whether points lie within
# the circle or not.
inp = inpoly(data1$x, data1$y, coords)
data1 = data1[inp == 1,]
# Finally add points that lie with the circle as blue filled dots
points(data1$x,data1$y,pch=19,col="blue")
# Radius of the circle (known area)
pi * radius^2
#[1] 314.1593
# Sub in your own data here to calculate 95% homerange or 50% core area usage
H.pi = Hpi(data1,binned=T)
fhat = kde(data1,H=H.pi)
ct1 = contourSizes(fhat, cont = 95, approx=TRUE)
# Compare the known area of the circle to the 95% contour size
ct1
# 5%
# 291.466
I've also tried creating 2 un-connected circles and testing the contourSizes() function and it seems to work really well on disjointed distributions.
I have occurrence points for a species, and I'd like to remove potential sampling bias (where some regions might have much greater density of points than others). One way to do this would be to maximize a subset of points that are no less than a certain distance X of each other. Essentially, I would prevent points from being too close to each other.
Are there any existing R functions to do this? I've searched through various spatial packages, but haven't found anything, and can't figure out exactly how to implement this myself.
An example occurrence point dataset can be downloaded here.
Thanks!
I've written a new version of this function that no longer really follows rMaternII.
The input can either be a SpatialPoints, SpatialPointsDataFrame or matrix object.
Seems to work well, but suggestions welcome!
filterByProximity <- function(xy, dist, mapUnits = F) {
#xy can be either a SpatialPoints or SPDF object, or a matrix
#dist is in km if mapUnits=F, in mapUnits otherwise
if (!mapUnits) {
d <- spDists(xy,longlat=T)
}
if (mapUnits) {
d <- spDists(xy,longlat=F)
}
diag(d) <- NA
close <- (d <= dist)
diag(close) <- NA
closePts <- which(close,arr.ind=T)
discard <- matrix(nrow=2,ncol=2)
if (nrow(closePts) > 0) {
while (nrow(closePts) > 0) {
if ((!paste(closePts[1,1],closePts[1,2],sep='_') %in% paste(discard[,1],discard[,2],sep='_')) & (!paste(closePts[1,2],closePts[1,1],sep='_') %in% paste(discard[,1],discard[,2],sep='_'))) {
discard <- rbind(discard, closePts[1,])
closePts <- closePts[-union(which(closePts[,1] == closePts[1,1]), which(closePts[,2] == closePts[1,1])),]
}
}
discard <- discard[complete.cases(discard),]
return(xy[-discard[,1],])
}
if (nrow(closePts) == 0) {
return(xy)
}
}
Let's test it:
require(rgeos)
require(sp)
pts <- readWKT("MULTIPOINT ((3.5 2), (1 1), (2 2), (4.5 3), (4.5 4.5), (5 5), (1 5))")
pts2 <- filterByProximity(pts,dist=2, mapUnits=T)
plot(pts)
axis(1)
axis(2)
apply(as.data.frame(pts),1,function(x) plot(gBuffer(SpatialPoints(coords=matrix(c(x[1],x[2]),nrow=1)),width=2),add=T))
plot(pts2,add=T,col='blue',pch=20,cex=2)
There is also an R package called spThin that performs spatial thinning on point data. It was developed for reducing the effects of sampling bias for species distribution models, and does multiple iterations for optimization. The function is quite easy to implement---the vignette can be found here. There is also a paper in Ecography with details about the technique.
Following Josh O'Brien's advice, I looked at spatstat's rMaternI function, and came up with the following. It seems to work pretty well.
The distance is in map units. It would be nice to incorporate one of R's distance functions that always returns distances in meters, rather than input units, but I couldn't figure that out...
require(spatstat)
require(maptools)
occ <- readShapeSpatial('occurrence_example.shp')
filterByProximity <- function(occ, dist) {
pts <- as.ppp.SpatialPoints(occ)
d <- nndist(pts)
z <- which(d > dist)
return(occ[z,])
}
occ2 <- filterByProximity(occ,dist=0.2)
plot(occ)
plot(occ2,add=T,col='blue',pch=20)
Rather than removing data points, you might consider spatial declustering. This involves giving points in clusters a lower weight than outlying points. The two simplest ways to do this involve a polygonal segmentation, like a Voronoi diagram, or some arbitrary grid. Both methods will weight points in each region according to the area of the region.
For example, if we take the points in your test (1,1),(2,2),(4.5,4.5),(5,5),(1,5) and apply a regular 2-by-2 mesh, where each cell is three units on a side, then the five points fall into three cells. The points ((1,1),(2,2)) falling into the cell [0,3]X[0,3] would each have weights 1/( no. of points in current cell TIMES tot. no. of occupied cells ) = 1 / ( 2 * 3 ). The same thing goes for the points ((4.5,4.5),(5,5)) in the cell (3,6]X(3,6]. The "outlier", (1,5) would have a weight 1 / ( 1 * 3 ). The nice thing about this technique is that it is a quick way to generate a density based weighting scheme.
A polygonal segmentation involves drawing a polygon around each point and using the area of that polygon to calculate the weight. Generally, the polygons completely cover the entire region, and the weights are calculated as the inverse of the area of each polygon. A Voronoi diagram is usually used for this, but polygonal segmentations may be calculated using other techniques, or may be specified by hand.