Maths - clustering of points - math

Hi all this is a very simple question, but my mind is a bit empty and i can't seem to find any satisfactory results on the internet.
Given a collection of 2d points (x,y), how can I determine how tightly grouped they are together.
Thanks
I guess an example would of helped.. I am trying to measure the "wobble" when aiming at a target, so I have every point the shooter aimed and I would like to see if they were steady or if they moved allot.

It depends on your definition of "tight grouping". One possibility is the sample variance, or the corresponding standard deviation. Crudely speaking, this gives you an "average" distance away from the centre point (which can be defined either as a known point, or as simply the average of your dataset).
For a group of 2D points, this can be defined as:
stddev = sqrt(var) = sqrt(1/N * SUM { (x - x0)^2 + (y - y0)^2 })
where (x0,y0) is the sample mean (i.e. the average of all your points).
This metric will be less sensitive to outliers than e.g. the bounding box metric.

One simple way to do this is to calculate the bounding box that contains all of the points and calculate the area from that, then divide the area value by the number of points to give you a points per area value. This could be enough depending on what you need it for but could be rather inacurate.

Related

What does it mean to spread points out as much as possile given that each point as its own range?

I need help defining/clarifying a problem I am trying to solve.
I have n points in one dimension. Each point has its own range that it can move between. They cannot move outside of that range. I need to "spread" the points out as much as possible. I want to maximize the distance between all adjacent points. What is a more formal way to say this?
Using a smaller example:
I can move the left and right points to the sides as much as possible. Then I can find the midpoint of this distance and move the middle point to that midpoint as much as possible.
I have maximized the average distance between all points but I have also minimized the variance of the distances as well.
These seem like the two core restraints I need to "spread out" the points as much as possible. I eventually need to develop an algorithm that can do this with an arbitrary number of points.
Am I correct that I need to maximize the average distance of all points while also minimizing the variance of these nearest neighbor distances?

Find minimum set of rays intersecting all voxels

Okay first I wasn't sure if this was better suited to the MathSO so apologies if it needs migrating.
I have a 3D grid of points (representing the centers of voxels) with pitch varying in each dimension, but regular. For example resolution may be 100 by 50 by 40 for a cube shaped object.
Giving me nVox = 200,000.
For each voxel - I would like to cast (nVox - 1) rays, ending at the center, and originating from each of the other voxels.
Now there is obviously a lot of overlap here but I am having trouble finding how to calculate the minimum set of rays required. This sounds like a problem that has an elegant solution, I am however struggling to find it.
As a start, it is obvious that you only need to compute
[nVox * (nVox - 1)] / 2
of the rays, as the other half will simply be in the opposite directions. It is also easy in the 2D case to combine all of those parallel to one of the grid axes (and the two diagonals).
So how do I find the minimum set of rays I need, to pass from all voxel centers, to all others?
If someone could point me in the right direction that'd be great. Any and all help will be much appreciated.
Your problem really isn't about three dimensions in any specific way. All the conceptual complexity is present in the two dimensional case.
Instead of connecting points individually, think about the set of lines that pass through at least two points on your grid. Thus instead of thinking about points initially, think about directions. For 2-D these directions are slopes of lines. These slopes have to be rational numbers, since they intersect points on an integer lattice. Since you have a finite lattice, the numerator and the denominator of the slope can be bounded by the size of the figure. So your underlying problem is enumerating possible slopes for rational numbers of bounded "height" (math jargon).
There's an algorithm for that. It's the one used to generate the Farey sequence of reduced fractions. If your figure is N pixels wide, there will (in general) be a slope with denominator N in the somewhere, but there can't be a slope in reduced form with denominator >N; it wouldn't fit.
It's easier to deal with slopes between 0 and 1 directly. You get the other directions by two operations: negating the slope and by interchanging axes. For three dimensions, you need two slopes to define a direction.
Given an arbitrary direction (no necessarily a rational one as above), there's a perpendicular linear space of dimension k-1; for 3-D that's a plane. Projecting a 3-D parallelpiped onto this plane yields a hexagon in general; two vertices project onto the interior, six project to the vertices of the hexagon.
For a given discrete direction, there's a minimal bounding box on the integer lattice such that two opposite vertices lie along that direction. As long as that bounding box fits within your original grid, each of the interior points of the projection each correspond to a line that intersects your grid in at least two points.
In summary, enumerate directions, then for each direction enumerate where that direction intersects your grid in at least two points.

Finding a density peak / cluster centrum in 2D grid / point process

I have a dataset with minute by minute GPS coordinates recorded by a persons cellphone. I.e. the dataset has 1440 rows with LON/LAT values. Based on the data I would like a point estimate (lon/lat value) of where the participants home is. Let's assume that home is the single location where they spend most of their time in a given 24h interval. Furthermore, the GPS sensor most of the time has quite high accuracy, however sometimes it is completely off resulting in gigantic outliers.
I think the best way to go about this is to treat it as a point process and use 2D density estimation to find the peak. Is there a native way to do this in R? I looked into kde2d (MASS) but this didn't really seem to do the trick. Kde2d creates a 25x25 grid of the data range with density values. However, in my data, the person can easily travel 100 miles or more per day, so these blocks are generally too large of an estimate. I could narrow them down and use a much larger grid but I am sure there must be a better way to get a point estimate.
There are "time spent" functions in the trip package (I'm the author). You can create objects from the track data that understand the underlying track process over time, and simply process the points assuming straight line segments between fixes. If "home" is where the largest value pixel is, i.e. when you break up all the segments based on the time duration and sum them into cells, then it's easy to find it. A "time spent" grid from the tripGrid function is a SpatialGridDataFrame with the standard sp package classes, and a trip object can be composed of one or many tracks.
Using rgdal you can easily transform coordinates to an appropriate map projection if lon/lat is not appropriate for your extent, but it makes no difference to the grid/time-spent calculation of line segments.
There is a simple speedfilter to remove fixes that imply movement that is too fast, but that is very simplistic and can introduce new problems, in general updating or filtering tracks for unlikely movement can be very complicated. (In my experience a basic time spent gridding gets you as good an estimate as many sophisticated models that just open up new complications). The filter works with Cartesian or long/lat coordinates, using tools in sp to calculate distances (long/lat is reliable, whereas a poor map projection choice can introduce problems - over short distances like humans on land it's probably no big deal).
(The function tripGrid calculates the exact components of the straight line segments using pixellate.psp, but that detail is hidden in the implementation).
In terms of data preparation, trip is strict about a sensible sequence of times and will prevent you from creating an object if the data have duplicates, are out of order, etc. There is an example of reading data from a text file in ?trip, and a very simple example with (really) dummy data is:
library(trip)
d <- data.frame(x = 1:10, y = rnorm(10), tms = Sys.time() + 1:10, id = gl(1, 5))
coordinates(d) <- ~x+y
tr <- trip(d, c("tms", "id"))
g <- tripGrid(tr)
pt <- coordinates(g)[which.max(g$z), ]
image(g, col = c("transparent", heat.colors(16)))
lines(tr, col = "black")
points(pt[1], pt[2], pch = "+", cex = 2)
That dummy track has no overlapping regions, but it shows that finding the max point in "time spent" is simple enough.
How about using the location that minimises the sum squared distance to all the events? This might be close to the supremum of any kernel smoothing if my brain is working right.
If your data comprises two clusters (home and work) then I think the location will be in the biggest cluster rather than between them. Its not the same as the simple mean of the x and y coordinates.
For an uncertainty on that, jitter your data by whatever your positional uncertainty is (would be great if you had that value from the GPS, otherwise guess - 50 metres?) and recompute. Do that 100 times, do a kernel smoothing of those locations and find the 95% contour.
Not rigorous, and I need to experiment with this minimum distance/kernel supremum thing...
In response to spacedman - I am pretty sure least squares won't work. Least squares is best known for bowing to the demands of outliers, without much weighting to things that are "nearby". This is the opposite of what is desired.
The bisquare estimator would probably work better, in my opinion - but I have never used it. I think it also requires some tuning.
It's more or less like a least squares estimator for a certain distance from 0, and then the weighting is constant beyond that. So once a point becomes an outlier, it's penalty is constant. We don't want outliers to weigh more and more and more as we move away from them, we would rather weigh them constant, and let the optimization focus on better fitting the things in the vicinity of the cluster.

Sphere that surely encompass given list of points [points are with x, y and z co-ordinate]

I am trying to find sphere that surly encompasses given list of points.
Points will have x, y and z co-ordinate[Points are in 3D].
Actually I am trying to find new three points based on given list of points by some calculations like find MinX,MaxX ,MinY,MaxY,and MinZ and MaxZ and do some operation and find new three points
And I will draw sphere from these three points.
And I will also taking all these three points on the diameter of sphere so I have a unique sphere.
Is there any standard way for finding encompassing sphere of given list of points?
Yes, the standard algorithm is Welzl's algorithm (assuming you want the minimal sphere around your points). Particularly the improved version of Gaertner is very useful, robust and numerically stable! It handles all the degenerate cases well too.
At its core, the algorithm permutes the points (randomly) to find the 1-4 points that lie on the boundary of the sphere. It's basically a clever trial-and-error algorithm. From these points, you can find the center by finding a point that has the same distance to all those points. Gärtner's version uses an improved numerical device to find the center. Also, it employs an extra pivoting step that presumably makes the algorithm work better for a large number of input points.
If all you want is a sphere around three points, I suggest you still use Gärtners "device" to compute the circumsphere of the triangle. Otherwise, the method will probably degenerate easily (i.e. when the triangle is very flat).
Do you need 3 points, or any number of points?
If you only need the answer for 3 points, each pair of points defines a line segment. Take the longest line segment. Take a sphere centered at the middle of that line segment, whose radius is half the length of the line segment. There are two cases.
The third point is inside of that initial sphere. If so, then you have the smallest sphere.
The third point is outside of that initial sphere. Then the solution at Find Circum Center of Three point of Triangle [Not using Compass] will give you the center of the smallest sphere containing those 3 points.
If you need an arbitrary number of points, I'd do some sort of iterative approximation algorithm. Since you don't seem like you need that, I won't work out the details.

How do I calculate a normal vector based on multiple triangles sharing a vertex?

If I have a mesh of triangles, how does one go about calculating the normals at each given vertex?
I understand how to find the normal of a single triangle. If I have triangles sharing vertices, I can partially find the answer by finding each triangle's respective normal, normalizing it, adding it to the total, and then normalizing the end result. However, this obviously does not take into account proper weighting of each normal (many tiny triangles can throw off the answer when linked with a large triangle, for example).
I think a good method should be using a weighted average but using angles instead of area as weights. This is in my opinion a better answer because the normal you are computing is a "local" feature so you don't really care about how big is the triangle that is contributing... you need a sort of "local" measure of the contribution and the angle between the two sides of the triangle on the specified vertex is such a local measure.
Using this approach a lot of small (thin) triangles doesn't give you an unbalanced answer.
Using angles is the same as using an area-weighted average if you localize the computation by using the intersection of the triangles with a small sphere centered in the vertex.
The weighted average appears to be the best approach.
But be aware that, depending on your application, sharp corners could still give you problems. In that case, you can compute multiple vertex normals by averaging surface normals whose cross product is less than some threshold (i.e., closer to being parallel).
Search for Offset triangular mesh using the multiple normal vectors of a vertex by SJ Kim, et. al., for more details about this method.
This blog post outlines three different methods and gives a visual example of why the standard and simple method (area weighted average of the normals of all the faces joining at the vertex) might sometimes give poor results.
You can give more weight to big triangles by multiplying the normal by the area of the triangle.
Check out this paper: Discrete Differential-Geometry Operators for Triangulated 2-Manifolds.
In particular, the "Discrete Mean Curvature Normal Operator" (Section 3.5, Equation 7) gives a robust normal that is independent of tessellation, unlike the methods in the blog post cited by another answer here.
Obviously you need to use a weighted average to get a correct normal, but using the triangles area won't give you what you need since the area of each triangle has no relationship with the % weight that triangles normal represents for a given vertex.
If you base it on the angle between the two sides coming into the vertex, you should get the correct weight for every triangle coming into it. It might be convenient if you could convert it to 2d somehow so you could go off of a 360 degree base for your weights, but most likely just using the angle itself as your weight multiplier for calculating it in 3d space and then adding up all the normals produced that way and normalizing the final result should produce the correct answer.

Resources