Shortest Euclidean distance between two groups in subset - r

I have a largish data frame (50000 points) representing points in 2D collected from biological images. Points are categorised as either red or green and are associated with each other in groups (in the example: cells A-D). A small test data set (MSR_test.csv) can be found here.
require(ggplot2)
cells <- read.csv("MSR_test.csv")
ggplot(cells, aes(X, Y, colour = channel, shape = cell)) +
geom_point() +
scale_colour_manual(values = c("green","red"), name = "channel")
I am trying to find a reasonably straight forward way (perhaps involving plyr?) to find the Euclidean distance between each green point and its nearest red point within the same ‘cell group’. Whilst I think I have worked out how to do this for an individual grouping (using rdist from package fields) I can’t seem to work out how to apply a method to my data frame.

I don't see any reason to use plyr, but maybe I'm wrong.
The following code works on your example. I did not use any heavy function to compute the Euclidean distance, mainly because you may have to compute it on a lot of points.
green <- subset(cells, channel=="Green")
red <- subset(cells, channel=="Red")
fun_dist <- function(a, M) rowSums( (M - matrix(1,nrow(M),1) %*% as.numeric(a))**2 )
foo <- function(greenrow, matred) {
subred <- subset(matred, cell == greenrow["cell"], select=c("X","Y"))
minred <- subred[ which.min(fun_dist(unlist(greenrow[c("X","Y")]),subred)), ]
return(minred)
}
data.frame( "rbind", apply(green, 1, foo, red) )

Related

In ggplot, how to draw a circle/disk with a line that divides its area according to a given ratio and colored points inside?

I want to visualize proportions using points inside a circle. For example, let's say that I have 100 points that I wish to scatter (somewhat randomly jittered) in a circle.
Next, I want to use this diagram to represent the proportions of people who voted Biden/Harris in 2020 US presidential elections, in each state.
Example #1 -- Michigan
Biden got 50.62% of Michigan's votes. I'm going to draw a horizontal diameter that splits the circle to two halves, and then color the points under the diameter in blue (Democrats' color).
Example #2 -- Wyoming
Unlike Michigan, in Wyoming Biden got only 26.55% of the votes, which is approximately a quarter of the vote. In this case I'd draw a horizontal chord that divides the circle such that the disk's area under the chord is 25% of the entire disk area. Then I'll color the respective points in that area in blue. Since I have 100 points in total, 25 points represent the 25% who voted Biden in Wyoming.
My question: How can I do this with ggplot? I researched this issue, and there's a lot of geometry going on here. First, the kind of area I'm talking about is called a "circular segment". Second, there are many formulas to calculate its area, if we know some other parameters about the shape (such as the radius length, etc.). See this nice demo.
However, my goal isn't to solve geometry problems, but just to represent proportions in a very specific way:
draw a circle
sprinkle X number of points inside
draw a (real or invisible) horizontal line that divides the circle/disk area according to a given proportion
ensure that the points are arranged respective to the split. That is, if we want to represent a 30%-70% split, then have 30% of the points under the line that divides the disk.
color the points under the line.
I understand that this is somewhat an exotic visualization, but I'll be thankful for any help with this.
EDIT
I've found a reference to a JavaScript package that does something very similar to what I'm asking.
I took a crack at this for fun. There's a lot more that could be done. I agree that this is not a great way to visualize proportions, but if it's engaging your audience ...
Formulas for determining appropriate heights are taken from Wikipedia. In particular we need the formulas
a/A = (theta - sin(theta))/(2*pi)
h = 1-cos(theta/2)
where a is the area of the segment; A is the whole area of the circle; theta is the angle described by the arc that defines the segment (see Wikipedia for pictures); and h is the height of the segment.
Machinery for finding heights.
afun <- function(x) (x-sin(x))/(2*pi)
## curve(afun, from=0, to = 2*pi)
find_a <- function(a) {
uniroot(
function(x) afun(x) -a,
interval=c(0, 2*pi))$root
}
find_h <- function(a) {
1- cos(find_a(a)/2)
}
vfind_h <- Vectorize(find_h)
## find_a(0.5)
## find_h(0.5)
## curve(vfind_h(x), from = 0, to= 1)
set up a circle
dd <- data.frame(x=0,y=0,r=1)
library(ggforce)
library(ggplot2); theme_set(theme_void())
gg0 <- ggplot(dd) + geom_circle(aes(x0=x,y0=y,r=r)) + coord_fixed()
finish
props <- c(0.2,0.5,0.3) ## proportions
n <- 100 ## number of points to scatter
cprop <- cumsum(props)[-length(props)]
h <- vfind_h(cprop)
set.seed(101)
r <- runif(n)
th <- runif(n, 0, 2 * pi)
dd <-
data.frame(x = sqrt(r) * cos(th),
y = sqrt(r) * sin(th))
dd2 <- data.frame(x=r*cos(2*pi*th), y = r*sin(2*pi*th))
dd2$g <- cut(dd2$y, c(1, 1-h, -1))
gg0 + geom_point(data=dd2, aes(x, y, colour = g), size=3)
There are a bunch of tweaks that would make this better (meaningful names for the categories; reverse the axis order to match the plot; maybe add segments delimiting the sections, or (more work) polygons so you can shade the sections.
You should definitely check this for mistakes — e.g. there are places where I may have used a set of values where I should have used their first differences, or vice versa (values vs cumulative sum). But this should get you started.

Shrink convex hull

I have a bunch of points in 2D space and have calculated a convex hull for them. I would now like to "tighten" the hull so that it no longer necessarily encompasses all points. In the typical nails-in-board-with-rubber-band analogy, what I'd like to achieve is to be able to tune the elasticity of the rubber band and allow nails to bend at pressure above some limit. That's just an analogy, there is no real physics here. This would kind-of be related to the reduction in hull area if a given point was removed, but not quite because there could be two points that are very close to each-other. This is not necessarily related to outlier detection, because you could imagine a pattern where a large fractions of the nails would bend if they are on a narrow line (imagine a hammer shape for example). All of this has to be reasonably fast for thousands of points. Any hints where I should look in terms of algorithms? An implementation in R would be perfect, but not needed.
EDIT AFTER COMMENT: The three points I've labelled are those with largest potential for reducing the hull area if they are excluded. In the plot there is no other set of three points that would result in a larger area reduction. A naive implementation of what I'm looking for would maybe be to randomly sample some fraction of the points, calculate the hull area, remove each point on the hull iteratively, recalculate the area, repeat many times and remove points that tend to lead to high area reduction. Maybe this could be implemented in some random forest variant? It's not quite right though, because I would like the points to be removed one by one so that you get the following result. If you looked at all points in one go it would possibly be best to trim from the edges of the "hammer head".
Suppose I have a set of points like this:
set.seed(69)
x <- runif(20)
y <- runif(20)
plot(x, y)
Then it is easy to find the subset points that sit on the convex hull by doing:
ss <- chull(x, y)
This means we can plot the convex hull by doing:
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "red")
Now we can randomly remove one of the points that sits on the convex hull (i.e. "bend a nail") by doing:
bend <- ss[sample(ss, 1)]
x <- x[-bend]
y <- y[-bend]
And we can then repeat the process of finding the convex hull of this new set of points:
ss <- chull(x, y)
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "blue", lty = 2)
To get the point which will, on removal, cause the greatest reduction in area, one option would be the following function:
library(sp)
shrink <- function(coords)
{
ss <- chull(coords[, 1], coords[, 2])
outlier <- ss[which.min(sapply(seq_along(ss),
function(i) Polygon(coords[ss[-i], ], hole = FALSE)#area))]
coords[-outlier, ]
}
So you could do something like:
coords <- cbind(x, y)
new_coords <- shrink(coords)
new_chull <- new_coords[chull(new_coords[, 1], new_coords[, 2]),]
new_chull <- rbind(new_chull, new_chull[1,])
plot(x, y)
lines(new_chull[,1], new_chull[, 2], col = "red")
Of course, you could do this in a loop so that new_coords is fed back into shrink multiple times.
Calculate a robust center and variance using mcd.cov in MASS and the mahalanobis distance of each point from it (using mahalanobis in psych). We then show a quantile plot of the mahalanobis distances using PlotMD from modi and also show the associated outliers in red in the second plot. (There are other functions in modi that may be of interest as well.)
library(MASS)
library(modi)
library(psych)
set.seed(69)
x <- runif(20)
y <- runif(20)
m <- cbind(x, y)
mcd <- cov.mcd(m)
md <- mahalanobis(m, mcd$center, mcd$cov)
stats <- PlotMD(md, 2, alpha = 0.90)
giving:
(continued after screenshot)
and we show the convex hull using lines and the outliers in red:
plot(m)
ix <- chull(m)
lines(m[c(ix, ix[1]), ])
wx <- which(md > stats$halpha)
points(m[wx, ], col = "red", pch = 20)
Thank you both! I've tried various methods for outlier detection, but it's not quite what I was looking for. They have worked badly due to weird shapes of my clusters. I know I talked about convex hull area, but I think filtering on segment lengths yields better results and is closer to what I really wanted. Then it would look something like this:
shrink <- function(xy, max_length = 30){
to_keep <- 1:(dim(xy)[1])
centroid <- c(mean(xy[,1]), mean(xy[,2]))
while (TRUE){
ss <- chull(xy[,1], xy[,2])
ss <- c(ss, ss[1])
lengths <- sapply(1:(length(ss)-1), function(i) sum((xy[ss[i+1],] - xy[ss[i],])^2))
# This gets the point with the longest convex hull segment. chull returns points
# in clockwise order, so the point to remove is either this one or the one
# after it. Remove the one furthest from the centroid.
max_point <- which.max(lengths)
if (lengths[max_point] < max_length) return(to_keep)
if (sum((xy[ss[max_point],] - centroid)^2) > sum((xy[ss[max_point + 1],] - centroid)^2)){
xy <- xy[-ss[max_point],]
to_keep <- to_keep[-ss[max_point]]
}else{
xy <- xy[-ss[max_point + 1],]
to_keep <- to_keep[-ss[max_point + 1]]
}
}
}
It's not optimal because it factors in the distance to the centroid, which I would have liked to avoid, and there is a max_length parameter that should be calculated from the data instead of being hard-coded.
No filter:
It looks like this because there are 500 000 cells in here, and there are many that end up "wrong" when projecting from ~20 000 dimensions to 2.
Filter:
Note that it filters out points at tips of some clusters. This is less-than-optimal but ok. The overlap between some clusters is true and should be there.

R, SOM, Kohonen Package, Outlier Detection

With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".
Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In
the real wine data set there are four relatively excelling values among the 177 wine sorts. See
the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)
Charts:
Generated data, 100 point in 2D in 5 distinct clusters and 2
outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are
the outliers which were correctly identified. I repeated the test
with 500, and 1000 data points and added solely 2 outliers. The
outliers were also found in those cases.
Distances for the real wine data set with potential outliers:
The row id of the potential outliers:
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
it produces the following output:
index value
59 59 12.22916
110 110 13.41211
121 121 15.86576
158 158 11.50079
My annotated code snippet:
data(wines)
scaled_wines <- scale(wines)
# creating and training SOM
som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
summary(som.wines)
#looking for outliers, dist = distance to the closest unit
som.wines$distances
len <- length(som.wines$distances)
index_in_vector <- c(1:len)
df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
colnames(df_wine) <-c("index", "value")
po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
plot(po)
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
Further Code Samples
With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).
Data Generation with Five Clusters and 2 Outliers and Plotting
library(stats)
library(ggplot2)
library(kohonen)
generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
num_of_points_per_cluster <- num_of_points/num_of_clusters
cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
arr<-array()
standard_dev_y <- 6000
standard_dev_x <- 2
# for reproducibility setting the random generator
set.seed(10)
for (i in 1:num_of_clusters){
centroid_y <- runif(1, min=10000, max=200000)
centroid_x <- runif(1, min=20, max=70)
cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
cluster <- cbind(cluster, i)
arr <- rbind(arr, cluster)
}
if(outliers){
#adding two outliers
arr <- rbind(arr, c(10000, 30, 6))
arr <- rbind(arr, c(150000, 70, 6))
}
colnames(arr) <-c("y", "x", "Cluster")
# WA to remove the first NA row
arr <- na.omit(arr)
return(arr)
}
scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
df <- data.frame(data_in)
colnames(df) <-c("y", "x", "Cluster")
pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx])))
pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
print(pl)
}
##################
# generating data
data <- generate_data(100, 5, TRUE)
print(data)
scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")
Preparation, Clustering and Plotting
I used the hierarchical clustering approach with the Kohonen Map (SOM).
normalising_data <- function(data){
# normalizing data points not the cluster identifiers
mtrx <- data.matrix(data)
umtrx <- scale(mtrx[,1:2])
umtrx <- cbind(umtrx, factor(mtrx[,3]))
colnames(umtrx) <-c("y", "x", "Cluster")
return(umtrx)
}
train_som <- function(umtrx){
# unsupervised learning
set.seed(7)
g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
#map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
map<-som(umtrx[, 1:2], grid=g)
summary(map)
return(map)
}
plot_som_data <- function(map){
par(mfrow=c(3,2))
# to plot some charactristics of the SOM map
plot(map, type='changes')
plot(map, type='codes', main="Mapping Data")
plot(map, type='count')
plot(map, type='mapping') # how many data points are held by each neuron
plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
#to switch the plot config to the normal
par(mfrow=c(1,1))
}
plot_disstances_to_the_closest_point <- function(map){
# to see which neuron is assigned to which value
map$unit.classif
#looking for outliers, dist = distance to the closest unit
map$distances
max(map$distances)
len <- length(map$distances)
index_in_vector <- c(1:len)
df<-data.frame(cbind(index_in_vector, map$distances))
colnames(df) <-c("index", "value")
po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
plot(po)
return(df)
}
###################
# unsupervised learning
umtrx <- normalising_data(data)
map<-train_som(umtrx)
plot_som_data(map)
#####################
# creating the dendogram and then the clusters for the neurons
dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
plot(dendogram)
clusters <- cutree(dendogram, 7)
clusters
length(clusters)
#visualising the clusters on the map
par(mfrow = c(1,1))
plot(map, type='dist.neighbours', main="Mapping Data")
add.cluster.boundaries(map, clusters)
Plots with the Clusters
You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.
#see the predicted clusters with the data set
# 1. add the vector of the neuron ids to the data
mapped_neurons <- map$unit.classif
umtrx <- cbind(umtrx, mapped_neurons)
# 2. taking the predicted clusters and adding them the the original matrix
# very good description of the apply functions:
# https://www.guru99.com/r-apply-sapply-tapply.html
get_cluster_for_the_row <- function(x, cltrs){
return(cltrs[x])
}
predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)
mtrx <- cbind(mtrx, predicted_clusters)
scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")
See the predicted clusters below in case there were outliers and in case there were not.
I am not quite sure though, but I often find that the distance measurement of two objects reside in a particular dimensional space uses mostly Euclidean distance. For example, two points A and B in a two dimensional space having location of A(x=3, y=4) and B(x=6, y=8) are 5 distance unit apart. It is a result of performing calculation of squareroot((3-6)^2 + (4-8)^2). This is also applied to the data whose greater dimension, by adding trailing power of two of the difference of the two point's value in a particular dimension. If A(x=3, y=4, z=5) and B(x=6, y=8, z=7) then the distance is squareroot((3-6)^2 + (4-8)^2 + (5-7)^2), and so on. In kohonen, I think that after the model has finished the training phase, the algorithm then calculates the distances of each datum to all nodes and then assign it to the nearest node (a node which has the smallest distance to it). Eventually, the values inside the variable 'distances' returned by the model is the distance of every datum to its nearest node. One thing to note from your script is that the algorithm does not measure the distance directly from the original property values that the data have, because they have been scaled prior to feeding the data to the model. The distance measurement is applied to the scaled version of the data. The scaling is a standard procedure to eliminate the dominance of a variable on top of another.
I believe that your method is acceptable, because the values inside the 'distances' variable are the distance of each datum to its nearest node. So if a value of the distance between a datum and its nearest node is high, then this also means: the distance of the datum to other nodes are obviously much much higher.

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

spatial filtering by proximity in R

I have occurrence points for a species, and I'd like to remove potential sampling bias (where some regions might have much greater density of points than others). One way to do this would be to maximize a subset of points that are no less than a certain distance X of each other. Essentially, I would prevent points from being too close to each other.
Are there any existing R functions to do this? I've searched through various spatial packages, but haven't found anything, and can't figure out exactly how to implement this myself.
An example occurrence point dataset can be downloaded here.
Thanks!
I've written a new version of this function that no longer really follows rMaternII.
The input can either be a SpatialPoints, SpatialPointsDataFrame or matrix object.
Seems to work well, but suggestions welcome!
filterByProximity <- function(xy, dist, mapUnits = F) {
#xy can be either a SpatialPoints or SPDF object, or a matrix
#dist is in km if mapUnits=F, in mapUnits otherwise
if (!mapUnits) {
d <- spDists(xy,longlat=T)
}
if (mapUnits) {
d <- spDists(xy,longlat=F)
}
diag(d) <- NA
close <- (d <= dist)
diag(close) <- NA
closePts <- which(close,arr.ind=T)
discard <- matrix(nrow=2,ncol=2)
if (nrow(closePts) > 0) {
while (nrow(closePts) > 0) {
if ((!paste(closePts[1,1],closePts[1,2],sep='_') %in% paste(discard[,1],discard[,2],sep='_')) & (!paste(closePts[1,2],closePts[1,1],sep='_') %in% paste(discard[,1],discard[,2],sep='_'))) {
discard <- rbind(discard, closePts[1,])
closePts <- closePts[-union(which(closePts[,1] == closePts[1,1]), which(closePts[,2] == closePts[1,1])),]
}
}
discard <- discard[complete.cases(discard),]
return(xy[-discard[,1],])
}
if (nrow(closePts) == 0) {
return(xy)
}
}
Let's test it:
require(rgeos)
require(sp)
pts <- readWKT("MULTIPOINT ((3.5 2), (1 1), (2 2), (4.5 3), (4.5 4.5), (5 5), (1 5))")
pts2 <- filterByProximity(pts,dist=2, mapUnits=T)
plot(pts)
axis(1)
axis(2)
apply(as.data.frame(pts),1,function(x) plot(gBuffer(SpatialPoints(coords=matrix(c(x[1],x[2]),nrow=1)),width=2),add=T))
plot(pts2,add=T,col='blue',pch=20,cex=2)
There is also an R package called spThin that performs spatial thinning on point data. It was developed for reducing the effects of sampling bias for species distribution models, and does multiple iterations for optimization. The function is quite easy to implement---the vignette can be found here. There is also a paper in Ecography with details about the technique.
Following Josh O'Brien's advice, I looked at spatstat's rMaternI function, and came up with the following. It seems to work pretty well.
The distance is in map units. It would be nice to incorporate one of R's distance functions that always returns distances in meters, rather than input units, but I couldn't figure that out...
require(spatstat)
require(maptools)
occ <- readShapeSpatial('occurrence_example.shp')
filterByProximity <- function(occ, dist) {
pts <- as.ppp.SpatialPoints(occ)
d <- nndist(pts)
z <- which(d > dist)
return(occ[z,])
}
occ2 <- filterByProximity(occ,dist=0.2)
plot(occ)
plot(occ2,add=T,col='blue',pch=20)
Rather than removing data points, you might consider spatial declustering. This involves giving points in clusters a lower weight than outlying points. The two simplest ways to do this involve a polygonal segmentation, like a Voronoi diagram, or some arbitrary grid. Both methods will weight points in each region according to the area of the region.
For example, if we take the points in your test (1,1),(2,2),(4.5,4.5),(5,5),(1,5) and apply a regular 2-by-2 mesh, where each cell is three units on a side, then the five points fall into three cells. The points ((1,1),(2,2)) falling into the cell [0,3]X[0,3] would each have weights 1/( no. of points in current cell TIMES tot. no. of occupied cells ) = 1 / ( 2 * 3 ). The same thing goes for the points ((4.5,4.5),(5,5)) in the cell (3,6]X(3,6]. The "outlier", (1,5) would have a weight 1 / ( 1 * 3 ). The nice thing about this technique is that it is a quick way to generate a density based weighting scheme.
A polygonal segmentation involves drawing a polygon around each point and using the area of that polygon to calculate the weight. Generally, the polygons completely cover the entire region, and the weights are calculated as the inverse of the area of each polygon. A Voronoi diagram is usually used for this, but polygonal segmentations may be calculated using other techniques, or may be specified by hand.

Resources