Combine points within given radius in R to a centroid - r

I feel like this can not be too hard. I know hclust() and cutree() but how do I obtain the coordinates of the centroids where no points distance from it is higher than a given radius? I know that points within range of the centroid may be already belong to a centroid not within range. I am fine with that.
set.seed(1)
data <- matrix(runif(100),ncol=2)
plot(data)
dclust <- hclust(dist(data),method="centroid")
cutree(dclust,h=0.1)
cutree(...,h=0.1) will already fail as the height of dclust is not ordered.

Using your data and running kmeans with 25 groups produces the following results. Is this what you are getting at?
Example <- kmeans(data, 25)
plot(data, type="n")
text(Example$centers, unlist(dimnames(Example$centers)), col="red")
text(data, as.character(Example$cluster), cex=.75)
cdist <- sqrt((data[,1] - Example$centers[Example$cluster, 1])^2 +
(data[, 2] - Example$centers[Example$cluster, 2])^2)
names(cdist) <- 1:50
cdist
The last three lines compute and display the distance of each point to the centroid to which it has been assigned.

Related

Shrink convex hull

I have a bunch of points in 2D space and have calculated a convex hull for them. I would now like to "tighten" the hull so that it no longer necessarily encompasses all points. In the typical nails-in-board-with-rubber-band analogy, what I'd like to achieve is to be able to tune the elasticity of the rubber band and allow nails to bend at pressure above some limit. That's just an analogy, there is no real physics here. This would kind-of be related to the reduction in hull area if a given point was removed, but not quite because there could be two points that are very close to each-other. This is not necessarily related to outlier detection, because you could imagine a pattern where a large fractions of the nails would bend if they are on a narrow line (imagine a hammer shape for example). All of this has to be reasonably fast for thousands of points. Any hints where I should look in terms of algorithms? An implementation in R would be perfect, but not needed.
EDIT AFTER COMMENT: The three points I've labelled are those with largest potential for reducing the hull area if they are excluded. In the plot there is no other set of three points that would result in a larger area reduction. A naive implementation of what I'm looking for would maybe be to randomly sample some fraction of the points, calculate the hull area, remove each point on the hull iteratively, recalculate the area, repeat many times and remove points that tend to lead to high area reduction. Maybe this could be implemented in some random forest variant? It's not quite right though, because I would like the points to be removed one by one so that you get the following result. If you looked at all points in one go it would possibly be best to trim from the edges of the "hammer head".
Suppose I have a set of points like this:
set.seed(69)
x <- runif(20)
y <- runif(20)
plot(x, y)
Then it is easy to find the subset points that sit on the convex hull by doing:
ss <- chull(x, y)
This means we can plot the convex hull by doing:
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "red")
Now we can randomly remove one of the points that sits on the convex hull (i.e. "bend a nail") by doing:
bend <- ss[sample(ss, 1)]
x <- x[-bend]
y <- y[-bend]
And we can then repeat the process of finding the convex hull of this new set of points:
ss <- chull(x, y)
lines(x[c(ss, ss[1])], y[c(ss, ss[1])], col = "blue", lty = 2)
To get the point which will, on removal, cause the greatest reduction in area, one option would be the following function:
library(sp)
shrink <- function(coords)
{
ss <- chull(coords[, 1], coords[, 2])
outlier <- ss[which.min(sapply(seq_along(ss),
function(i) Polygon(coords[ss[-i], ], hole = FALSE)#area))]
coords[-outlier, ]
}
So you could do something like:
coords <- cbind(x, y)
new_coords <- shrink(coords)
new_chull <- new_coords[chull(new_coords[, 1], new_coords[, 2]),]
new_chull <- rbind(new_chull, new_chull[1,])
plot(x, y)
lines(new_chull[,1], new_chull[, 2], col = "red")
Of course, you could do this in a loop so that new_coords is fed back into shrink multiple times.
Calculate a robust center and variance using mcd.cov in MASS and the mahalanobis distance of each point from it (using mahalanobis in psych). We then show a quantile plot of the mahalanobis distances using PlotMD from modi and also show the associated outliers in red in the second plot. (There are other functions in modi that may be of interest as well.)
library(MASS)
library(modi)
library(psych)
set.seed(69)
x <- runif(20)
y <- runif(20)
m <- cbind(x, y)
mcd <- cov.mcd(m)
md <- mahalanobis(m, mcd$center, mcd$cov)
stats <- PlotMD(md, 2, alpha = 0.90)
giving:
(continued after screenshot)
and we show the convex hull using lines and the outliers in red:
plot(m)
ix <- chull(m)
lines(m[c(ix, ix[1]), ])
wx <- which(md > stats$halpha)
points(m[wx, ], col = "red", pch = 20)
Thank you both! I've tried various methods for outlier detection, but it's not quite what I was looking for. They have worked badly due to weird shapes of my clusters. I know I talked about convex hull area, but I think filtering on segment lengths yields better results and is closer to what I really wanted. Then it would look something like this:
shrink <- function(xy, max_length = 30){
to_keep <- 1:(dim(xy)[1])
centroid <- c(mean(xy[,1]), mean(xy[,2]))
while (TRUE){
ss <- chull(xy[,1], xy[,2])
ss <- c(ss, ss[1])
lengths <- sapply(1:(length(ss)-1), function(i) sum((xy[ss[i+1],] - xy[ss[i],])^2))
# This gets the point with the longest convex hull segment. chull returns points
# in clockwise order, so the point to remove is either this one or the one
# after it. Remove the one furthest from the centroid.
max_point <- which.max(lengths)
if (lengths[max_point] < max_length) return(to_keep)
if (sum((xy[ss[max_point],] - centroid)^2) > sum((xy[ss[max_point + 1],] - centroid)^2)){
xy <- xy[-ss[max_point],]
to_keep <- to_keep[-ss[max_point]]
}else{
xy <- xy[-ss[max_point + 1],]
to_keep <- to_keep[-ss[max_point + 1]]
}
}
}
It's not optimal because it factors in the distance to the centroid, which I would have liked to avoid, and there is a max_length parameter that should be calculated from the data instead of being hard-coded.
No filter:
It looks like this because there are 500 000 cells in here, and there are many that end up "wrong" when projecting from ~20 000 dimensions to 2.
Filter:
Note that it filters out points at tips of some clusters. This is less-than-optimal but ok. The overlap between some clusters is true and should be there.

how to get point set (x,y) in a desired area in r

The figure is the plot of x,y set in a excel file, total 8760 pair of x and y. I want to remove the noise data pair in red circle area and output a new excel file with remain data pair. How could I do it in R?
Using #G5W's example:
Make up data:
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
Fit a quantile regression to the 90th percentile:
library(quantreg)
library(splines)
qq <- rq(y~ns(x,20),tau=0.9,data=XY)
Compute and draw the predicted curve:
xvec <- seq(0,16,length.out=101)
pp <- predict(qq,newdata=data.frame(x=xvec))
plot(y~x,data=XY)
lines(xvec,pp,col=2,lwd=2)
Keep only points below the predicted line:
XY2 <- subset(XY,y<predict(qq,newdata=data.frame(x)))
plot(y~x,data=XY2)
lines(xvec,pp,col=2,lwd=2)
You can make the line less wiggly by lowering the number of knots, e.g. y~ns(x,10)
Both R and EXCEL read and write .csv files, so you can use those to transfer the data back and forth.
You do not provide any data so I made some junk data to produce a similar problem.
DATA
set.seed(2017)
x = runif(8760, 0,16)
y = c(abs(rnorm(8000, 0, 1)), runif(760,0,8))
XY = data.frame(x,y)
One way to identify noise points is by looking at the distance to the nearest neighbors. In dense areas, nearest neighbors will be closer. In non-dense areas, they will be further apart. The package dbscan provides a nice function to get the distance to the k nearest neighbors. For this problem, I used k=6, but you may need to tune for your data. Looking at the distribution of distances to the 6th nearest neighbor we see that most points have 6 neighbors within a distance of 0.2
XY6 = kNNdist(XY, 6)
plot(density(XY6[,6]))
So I will assume that point whose 6th nearest neighbor is further away are noise points. Just changing the color to see which points are affected, we get
TYPE = rep(1,8760)
TYPE[XY6[,6] > 0.2] = 2
plot(XY, col=TYPE)
Of course, if you wish to restrict to the non-noise points, you can use
NonNoise = XY[XY6[,6] > 0.2,]

Coordinates of greatest kernel density for trivariate KDE

I have calculated the kernel density of a 3-column matrix in R using the following code:
ss<-read.table("data.csv",header=TRUE,sep=",")
x<-ss[,1]
y<-ss[,2]
z<-ss[,3]
ssdata<-c(x,y,z)
ssmat<-matrix(ssdata,,3)
rp<-kde(ssmat)
plot(rp)
What I need now are the (x,y,z) coordinates of the point of maximum kernel density. Based on the answer provided at on the R-help list, I understand that the kde() function plots the joint density of the three variables in a fourth dimension which is represented in the 3d plot by shading to indicate areas of greater point density. So in effect I am trying to locate the maximum value of this "fourth" dimension. I suspect that this is a relatively simple problem but I haven't been able to find the answer. Any ideas?
You can extract the max value from the info returned from kde. To see all the stuff returned, use str(rp).
## Get the indices
inds <- which(abs(rp$estimate - max(rp$estimate)) < 1e-10, arr.ind=T)
xyz <- mapply(function(a, b) a[b], rp$eval.points, inds)
## Add it to plot
plot(rp)
points3d(x=xyz[1], y=xyz[2], z=xyz[3], size=20, col="blue")

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

spatial distribution of points, R

What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?
1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.
A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')

Resources