Finding a density peak / cluster centrum in 2D grid / point process - r

I have a dataset with minute by minute GPS coordinates recorded by a persons cellphone. I.e. the dataset has 1440 rows with LON/LAT values. Based on the data I would like a point estimate (lon/lat value) of where the participants home is. Let's assume that home is the single location where they spend most of their time in a given 24h interval. Furthermore, the GPS sensor most of the time has quite high accuracy, however sometimes it is completely off resulting in gigantic outliers.
I think the best way to go about this is to treat it as a point process and use 2D density estimation to find the peak. Is there a native way to do this in R? I looked into kde2d (MASS) but this didn't really seem to do the trick. Kde2d creates a 25x25 grid of the data range with density values. However, in my data, the person can easily travel 100 miles or more per day, so these blocks are generally too large of an estimate. I could narrow them down and use a much larger grid but I am sure there must be a better way to get a point estimate.

There are "time spent" functions in the trip package (I'm the author). You can create objects from the track data that understand the underlying track process over time, and simply process the points assuming straight line segments between fixes. If "home" is where the largest value pixel is, i.e. when you break up all the segments based on the time duration and sum them into cells, then it's easy to find it. A "time spent" grid from the tripGrid function is a SpatialGridDataFrame with the standard sp package classes, and a trip object can be composed of one or many tracks.
Using rgdal you can easily transform coordinates to an appropriate map projection if lon/lat is not appropriate for your extent, but it makes no difference to the grid/time-spent calculation of line segments.
There is a simple speedfilter to remove fixes that imply movement that is too fast, but that is very simplistic and can introduce new problems, in general updating or filtering tracks for unlikely movement can be very complicated. (In my experience a basic time spent gridding gets you as good an estimate as many sophisticated models that just open up new complications). The filter works with Cartesian or long/lat coordinates, using tools in sp to calculate distances (long/lat is reliable, whereas a poor map projection choice can introduce problems - over short distances like humans on land it's probably no big deal).
(The function tripGrid calculates the exact components of the straight line segments using pixellate.psp, but that detail is hidden in the implementation).
In terms of data preparation, trip is strict about a sensible sequence of times and will prevent you from creating an object if the data have duplicates, are out of order, etc. There is an example of reading data from a text file in ?trip, and a very simple example with (really) dummy data is:
library(trip)
d <- data.frame(x = 1:10, y = rnorm(10), tms = Sys.time() + 1:10, id = gl(1, 5))
coordinates(d) <- ~x+y
tr <- trip(d, c("tms", "id"))
g <- tripGrid(tr)
pt <- coordinates(g)[which.max(g$z), ]
image(g, col = c("transparent", heat.colors(16)))
lines(tr, col = "black")
points(pt[1], pt[2], pch = "+", cex = 2)
That dummy track has no overlapping regions, but it shows that finding the max point in "time spent" is simple enough.

How about using the location that minimises the sum squared distance to all the events? This might be close to the supremum of any kernel smoothing if my brain is working right.
If your data comprises two clusters (home and work) then I think the location will be in the biggest cluster rather than between them. Its not the same as the simple mean of the x and y coordinates.
For an uncertainty on that, jitter your data by whatever your positional uncertainty is (would be great if you had that value from the GPS, otherwise guess - 50 metres?) and recompute. Do that 100 times, do a kernel smoothing of those locations and find the 95% contour.
Not rigorous, and I need to experiment with this minimum distance/kernel supremum thing...

In response to spacedman - I am pretty sure least squares won't work. Least squares is best known for bowing to the demands of outliers, without much weighting to things that are "nearby". This is the opposite of what is desired.
The bisquare estimator would probably work better, in my opinion - but I have never used it. I think it also requires some tuning.
It's more or less like a least squares estimator for a certain distance from 0, and then the weighting is constant beyond that. So once a point becomes an outlier, it's penalty is constant. We don't want outliers to weigh more and more and more as we move away from them, we would rather weigh them constant, and let the optimization focus on better fitting the things in the vicinity of the cluster.

Related

Spatstat, using the Matérn cluster process to generate homogeneous landscapes, how do I interpret the Ripley K function?

I am looking to develop a point process that ranges from homogeneous, i.e. no correlation between points to a point cluster process that does have correlation between points. From experimentation I can see that using the Matérn cluster process I can generate landscapes that are clustered.
library(spatstat)
plot(rMatClust(kappa=3,r=0.1,mu=50))
I want to use the simplest code that increases the level of homogeneity, i.e. decreasing dependence of points on each other. I do not want to use a binary model where either the pattern is homogeneous or not. i.e. Just a poisson process which can be generated such as:
plot(rpoispp(150))
From experimentation I noticed that if I increase the radius of the clusters using the Matérn cluster process, I do seem to create a pseudo homogeneous pattern.
plot(rMatClust(kappa=3,r=0.3,mu=50))
plot(rMatClust(kappa=3,r=0.7,mu=50))
Is this a good way of generating degrees of homogeneity? I understand that I can use statistical tests to measure the degree of clustering compared to a complete poisson process, such as the Ripley K test. For example, if I assign the Matérn cluster process data to variables, such as:
a<-rMatClust(kappa=3,r=0.1,mu=50)
b<-rMatClust(kappa=3,r=0.3,mu=50)
c<-rMatClust(kappa=3,r=0.7,mu=50)
Then use the Ripley K test and plot the results:
plot(Kest(a))
plot(Kest(b))
plot(Kest(c))
I can see that the difference between a homogeneous poisson process and the clustered point process decreases. I still do not fully understand the significance of the various K values according to edge effects and so forth, and how to interpret the Ripley K function, but I think this is the right direction to be heading in? How do I interpret the Ripley K function? Another problem is the number of points in each plot, I do not have a consistent number of points in each plot, as can be seen by:
summary(a)
summary(b)
summary(c)
Any knowledgeable feedback on this is greatly appreciated.
The standard terminology is that you want to generate a clustered point pattern.
The function rMatClust generates a clustered point pattern at random, in a two-stage process. The first stage is to generate "parent" points completely at random. The second stage is to generate, for each "parent", a random number of "offspring" points, and to place the "offspring" points inside a circle of radius R around their "parent". The final result is the collection of all "offspring" points. From this description (and help(rMatClust)) you can figure out what happens for different parameter values.
The K function (not the "K test") is a summary of the spacing between points in a point pattern. At a distance r, the value of K(r) is the normalised average number of points observed to fall within distance r of a typical point in the pattern. It is normalised so that it does not depend on the number of points, making it possible to compare patterns with different numbers of points.
When you plot the K function, one of the curves is the theoretical curve that would be expected if the points are completely random, and the other curves are computed from the data point pattern. This allows you to assess whether the point pattern appears to be clustered.
I strongly suggest you do some reading in Chapter 7 of the spatstat book. You can download this chapter for free.

segmenting lat/long data graph into lines/vectors

I have lat/lng data of multirotor UAV flights. There are alot of datapoints (~13k per flight) and I wish to find line segments from the data. They give me flight speed and direction. I know that most of the flights are guided missons meaning a point is given to fly to. However the exact points are unknown to me.
Here is a graph of a single flight lat/lng shifted to near (0,0) so they are visible on the same time-series graph.
I attempted to generate similar data, but there are several constraints and it may take more time to solve than working on the segmenting.
The graphs start and end nearly always at the same point.
Horisontal lines mean the UAV is stationary. These segments are expected.
Beginning and and end are always stationary for takeoff and landing.
There is some level of noise in the lines for the gps accuracy tho seemingly not that much.
Alot of data points.
The number of segments is unknown.
The noise I could calculate given the segments and least squares method to the line. Currently I'm thinking of sampling the data (to decimate it a little) and constructing lines. Merging the lines with smaller angle than x (dependant on the noise) and finding the intersection points of the lines left.
Another thought is to try and look at this problem in the frequency domain. The corners should be quite high frequency. Maybe I could make a custom filter kernel that would enable me to use a window function and win in efficency.
EDIT: Rewrote the question for more clarity and less rambling.

R: Find a shape from a point cloud

I have a point cloud like such below
df <- data.frame(x=c(2,3,3,5,6,2,6,7,7,4,3,8,9,10,10,12,11,12,14,15),
y=c(6,5,4,4,4,4,3,3,2,3,7,3,2,3,4,6,5,5,4,6))
plot(df,xlab="",ylab="",pch=20)
Think of them as gps coordinates of movement by an animal. I would like to find the spatial area covered by the points (animal). The most obvious solution is a convex hull which produces this:
df1 <- df[chull(x = df$x,y=df$y),]
polygon(x = df1$x,df1$y)
But this is not the result I am looking for. The movement area is not a closed geometric shape, but rather a boomerang kind of shape. The convex hull covers a lot of area not covered by the animal thereby overestimating the area. I am looking for something like this:
Of course, this is a mock dataset to give an idea. The original datasets have lot more points and varying geometries in point cloud. I was thinking along the lines of DBSCAN or minimum spanning networks, but they don't quite work.
I am not sure how to describe this geometrically or mathematically. If anyone has any ideas on how to approach this (even if it's not a full solution), I would very much appreciate that. If anyone has a better title for this question, that would be nice too :-)
Thanks.
Update ----------------------------------------------------------------
Plot of (minimum spanning tree) MST. I think this might be in the right direction.
library(ape)
d <- dist(df)
mstree <-mst(d)
plot(mstree, x1 = df$x, x2 = df$y)
Try alphahull
library(alphahull)
p <- ahull(df$x, df$y, alpha = 2.5)
plot(p)
Still, purely geometric tricks like this are rarely helpful for animal tracking data. It's too ad hoc to be applicable for other cases, doesn't have anything for the temporal component or information about the environment or the uncertainty of the locations or the relationship between the point samples and the real track etc etc.
library(geometry)
polyarea(df$x, df$y)
[1] 18.5
This requires the right order though.
You might want to consider an approach based on TSP heuristics. Such approaches are near ideal when all points are relevant.
Below is a simple approach extended from the insertion heuristic for TSP that might be workable, but it's O(N^2) or worst unless you rather careful with the data structure. The link gives the following for the heuristic description of the convex hull method.
Convex Hull, O(n^2*log^2(n))
Find the convex hull of our set of cities, and make it our initial subtour.
For each city not in the subtour, find its cheapest insertion (as in step 3 of Nearest Insertion). Then chose the city with the least
cost/increase ratio, and insert it.
Repeat step 2 until no more cities remain.
In this case, the cities are the data points, and since the goal isn't to connect to all of the data points but rather get the general shape, an extra step is needed to determine when a data point either shouldn't be added or is no longer needed and can be removed. The issue though is that it's not clear what what points would be considered irrelevant.
This TSP Test Data site should give you an idea of what the results of that heuristic will be, and how you want to go about removing points form the resulting "tour", which you consider irrelevant.
Although possibility solution is to keep track of the original convex hull, and limit the increase in distance between two adjacent hull points to some (relatively small) multiple of the original distance between the hull points, which is similar to how alpha hulls work. This would prevent shapes such as the one at the bottom of this, TSP Test Case BCL380, by limiting the distance that can be traveled between two hull points.

Maths - clustering of points

Hi all this is a very simple question, but my mind is a bit empty and i can't seem to find any satisfactory results on the internet.
Given a collection of 2d points (x,y), how can I determine how tightly grouped they are together.
Thanks
I guess an example would of helped.. I am trying to measure the "wobble" when aiming at a target, so I have every point the shooter aimed and I would like to see if they were steady or if they moved allot.
It depends on your definition of "tight grouping". One possibility is the sample variance, or the corresponding standard deviation. Crudely speaking, this gives you an "average" distance away from the centre point (which can be defined either as a known point, or as simply the average of your dataset).
For a group of 2D points, this can be defined as:
stddev = sqrt(var) = sqrt(1/N * SUM { (x - x0)^2 + (y - y0)^2 })
where (x0,y0) is the sample mean (i.e. the average of all your points).
This metric will be less sensitive to outliers than e.g. the bounding box metric.
One simple way to do this is to calculate the bounding box that contains all of the points and calculate the area from that, then divide the area value by the number of points to give you a points per area value. This could be enough depending on what you need it for but could be rather inacurate.

How to resample/rebin a spectrum?

In Matlab, I frequently compute power spectra using Welch's method (pwelch), which I then display on a log-log plot. The frequencies estimated by pwelch are equally spaced, yet logarithmically spaced points would be more appropriate for the log-log plot. In particular, when saving the plot to a PDF file, this results in a huge file size because of the excess of points at high frequency.
What is an effective scheme to resample (rebin) the spectrum, from linearly spaced frequencies to log-spaced frequencies? Or, what is a way to include high-resolution spectra in PDF files without generating excessively large files sizes?
The obvious thing to do is to simply use interp1:
rate = 16384; %# sample rate (samples/sec)
nfft = 16384; %# number of points in the fft
[Pxx, f] = pwelch(detrend(data), hanning(nfft), nfft/2, nfft, rate);
f2 = logspace(log10(f(2)), log10(f(end)), 300);
Pxx2 = interp1(f, Pxx, f2);
loglog(f2, sqrt(Pxx2));
However, this is undesirable because it does not conserve power in the spectrum. For example, if there is a big spectral line between two of the new frequency bins, it will simply be excluded from the resulting log-sampled spectrum.
To fix this, we can instead interpolate the integral of the power spectrum:
df = f(2) - f(1);
intPxx = cumsum(Pxx) * df; % integrate
intPxx2 = interp1(f, intPxx, f2); % interpolate
Pxx2 = diff([0 intPxx2]) ./ diff([0 F]); % difference
This is cute and mostly works, but the bin centers aren't quite right, and it doesn't intelligently handle the low-frequency region, where the frequency grid may become more finely sampled.
Other ideas:
write a function that determines the new frequency binning and then uses accumarray to do the rebinning.
Apply a smoothing filter to the spectrum before doing interpolation. Problem: the smoothing kernel size would have to be adaptive to the desired logarithmic smoothing.
The pwelch function accepts a frequency-vector argument f, in which case it computes the PSD at the desired frequencies using the Goetzel algorithm. Maybe just calling pwelch with a log-spaced frequency vector in the first place would be adequate. (Is this more or less efficient?)
For the PDF file-size problem: include a bitmap image of the spectrum (seems kludgy--I want nice vector graphics!);
or perhaps display a region (polygon/confidence interval) instead of simply a segmented line to indicate the spectrum.
I would let it do the work for me and give it the frequencies from the start. The doc states the freqs you specify will be rounded to the nearest DFT bin. That shouldn't be a problem since you are using the results to plot. If you are concerned about the runtime, I'd just try it and time it.
If you want to rebin it yourself, I think you're better off just writing your own function to do the integration over each of your new bins. If you want to make your life easier, you can do what they do and make sure your log bins share boundaries with your linear ones.
Solution found: https://dsp.stackexchange.com/a/2098/64
Briefly, one solution to this problem is to perform Welch's method with a frequency-dependent transform length. The above link is to a dsp.SE answer containing a paper citation and sample implementation. A disadvantage of this technique is that you can't use the FFT, but because the number of DFT points being computed is greatly reduced, this is not a severe problem.
If you want to resample an FFT at a variable rate (logarithmically), then the smoothing or low pass filter kernel will need to be variable width as well to avoid aliasing (loss of sample points). Just use a different width Sync interpolation kernel for each plot point (Sync width approximately the reciprocal of the local sampling rate).

Resources