Control Charts in R - r
I am trying to create a control chart for metrics that are essentially increasing over time. If I attempt to create a Shewhart chart, there will be many points that are above the upper specification limit.
So for example,
My metric is Revenue. Since it is a fast growing company, Revenue is going to be increasing over the specification limit over time. The main thing I want to track is when it is below the lower specification limit.
I know this is very vague but essentially I want to create a control chart that has data increasing over time.
Thanks
Besterfield in his book Quality Control, sixth edition answers this question. He discusses this as a "Chart for Trends".
Chart for Trends created in Excel
The process involves regression to determine the slope of your center line. The equation is $$\overline{X}=a+bG$$ where $\overline{X}$ is the subgroup average, G is the subgroup number, a is the intercept, and b is the slope.
$$a=\frac{(\sum \overline{X})(\sum G^2)-(\sum G)(\sum G\overline{X})}{g\sum G^2-(\sum G)^2}$$
$$b=\frac{g\sum G \overline{X}-(\sum G)(\sum \overline{X})}{g\sum G^2-(\sum G)^2}$$
where g is the number of subgroups.
The coefficients of a and b are obtained by establishing by columns for G, $\overline{X}$, G$\overline{X}$, and G^2 … ; determining their sums; and inserting their sums into the equation.
Once the trend-line equation is known, it can be plotted on the chart by assuming values of G and calculating $\overline{X}$. When two points are plotted, the trend line is drawn between them. The control limits are drawn on each side of the trend line a distance (in the perpendicular direction) equal to $$A_2\overline{R}$$ …
The R chart will generally have the typical appearance… However the dispersion may also be increasing.
Besterfield also suggests having a URL and LRL or Upper Rejection Limit and Lower Rejection Limit as lines parallel to the horizontal axis and indicate times when the process would be unacceptable.
Insert the codes between the "$$" into the Online LaTeX Equation Editor if you would like to visualize the equations easier (a limitation of my reputation and this page).
Dropped by here today; it’s an old question that probably has been answered. If not, try a seasonality analysis and plot your deseasonalised séries (tendence removed) on a control chart tool like qicharts2 (in R). If you need more details, just let me know, I can elaborate a more detailed answer.
Sounds like plot() is your game here. Time on the X, revenue on the Y with an abline(h= ) for your specification limit. Set a horizontal line where your specification limit is and you are good to go.
Related
Spatstat, using the Matérn cluster process to generate homogeneous landscapes, how do I interpret the Ripley K function?
I am looking to develop a point process that ranges from homogeneous, i.e. no correlation between points to a point cluster process that does have correlation between points. From experimentation I can see that using the Matérn cluster process I can generate landscapes that are clustered. library(spatstat) plot(rMatClust(kappa=3,r=0.1,mu=50)) I want to use the simplest code that increases the level of homogeneity, i.e. decreasing dependence of points on each other. I do not want to use a binary model where either the pattern is homogeneous or not. i.e. Just a poisson process which can be generated such as: plot(rpoispp(150)) From experimentation I noticed that if I increase the radius of the clusters using the Matérn cluster process, I do seem to create a pseudo homogeneous pattern. plot(rMatClust(kappa=3,r=0.3,mu=50)) plot(rMatClust(kappa=3,r=0.7,mu=50)) Is this a good way of generating degrees of homogeneity? I understand that I can use statistical tests to measure the degree of clustering compared to a complete poisson process, such as the Ripley K test. For example, if I assign the Matérn cluster process data to variables, such as: a<-rMatClust(kappa=3,r=0.1,mu=50) b<-rMatClust(kappa=3,r=0.3,mu=50) c<-rMatClust(kappa=3,r=0.7,mu=50) Then use the Ripley K test and plot the results: plot(Kest(a)) plot(Kest(b)) plot(Kest(c)) I can see that the difference between a homogeneous poisson process and the clustered point process decreases. I still do not fully understand the significance of the various K values according to edge effects and so forth, and how to interpret the Ripley K function, but I think this is the right direction to be heading in? How do I interpret the Ripley K function? Another problem is the number of points in each plot, I do not have a consistent number of points in each plot, as can be seen by: summary(a) summary(b) summary(c) Any knowledgeable feedback on this is greatly appreciated.
The standard terminology is that you want to generate a clustered point pattern. The function rMatClust generates a clustered point pattern at random, in a two-stage process. The first stage is to generate "parent" points completely at random. The second stage is to generate, for each "parent", a random number of "offspring" points, and to place the "offspring" points inside a circle of radius R around their "parent". The final result is the collection of all "offspring" points. From this description (and help(rMatClust)) you can figure out what happens for different parameter values. The K function (not the "K test") is a summary of the spacing between points in a point pattern. At a distance r, the value of K(r) is the normalised average number of points observed to fall within distance r of a typical point in the pattern. It is normalised so that it does not depend on the number of points, making it possible to compare patterns with different numbers of points. When you plot the K function, one of the curves is the theoretical curve that would be expected if the points are completely random, and the other curves are computed from the data point pattern. This allows you to assess whether the point pattern appears to be clustered. I strongly suggest you do some reading in Chapter 7 of the spatstat book. You can download this chapter for free.
segmenting lat/long data graph into lines/vectors
I have lat/lng data of multirotor UAV flights. There are alot of datapoints (~13k per flight) and I wish to find line segments from the data. They give me flight speed and direction. I know that most of the flights are guided missons meaning a point is given to fly to. However the exact points are unknown to me. Here is a graph of a single flight lat/lng shifted to near (0,0) so they are visible on the same time-series graph. I attempted to generate similar data, but there are several constraints and it may take more time to solve than working on the segmenting. The graphs start and end nearly always at the same point. Horisontal lines mean the UAV is stationary. These segments are expected. Beginning and and end are always stationary for takeoff and landing. There is some level of noise in the lines for the gps accuracy tho seemingly not that much. Alot of data points. The number of segments is unknown. The noise I could calculate given the segments and least squares method to the line. Currently I'm thinking of sampling the data (to decimate it a little) and constructing lines. Merging the lines with smaller angle than x (dependant on the noise) and finding the intersection points of the lines left. Another thought is to try and look at this problem in the frequency domain. The corners should be quite high frequency. Maybe I could make a custom filter kernel that would enable me to use a window function and win in efficency. EDIT: Rewrote the question for more clarity and less rambling.
Adding plotstick-like arrows to a scatterplot
This is my first post here, thought i have read a lot of your Q&A these last 6 months. I'm currently working on ADCP (Aquatic Doppler Current Profiler) data, handled by the "oce" package from Dan Kelley (a little bit of advertising for those who want to deal with oceanographic datas in R). I'm not very experienced in R, and i have read the question relative to abline for levelplot functions "How to add lines to a levelplot made using lattice (abline somehow not working)?". What i currently have is a levelplot representing a time series of echo intensity (from backscattered signal, which is monitored in the same time as current is) data taken in 10m of depth, this 10m depth line is parted into 25 rows, where each measurement is done along the line. (see the code part to obtain an image of what i have) (unfortunately, my reputation doesn't allow me to post images). I then proceed to generate an other plot, which represents arrows of the current direction as: The length of each arrow gives an indication of the current strength Its orientation is represented (all of this is done by taking the two components of the current intensity (East-West / North-South) and represent the resulting current). There is an arrow drawn for each tick of time (thus for the 1000 columns of my example data, there are always two components of the current intensity). Those arrows are drawn at the beginning of each measurement cell, thus at each row of my data, allowing to have a representation of currents for the whole water column. You can see the code part to have a "as i have" representation of currents The purpose of this question is to understand how i can superimpose those two representations, drawing my current arrows at each row of the represented data, thus making a representation of both current direction, intensity and echo intensity. Here i can't find any link to describe what i mean, but this is something i have already seen. I tried with the panel function which seems to be the best option, but my knowledge of R and the handeling of this kind of work is small, and i hope one of you may have the time and the knowledges to help me to solve this problem way faster than i could. I am, of course, available to answer any questions or give precisions. I may ask a lot more, after working on a large code for 6 months, my thirst for learning is now large. Code to represent data : Here are some data to represent what I have: U (north/south component of velocity) and V (East/west): U1= c(0.043,0.042,0.043,0.026,0.066,-0.017,-0.014,-0.019,0.024,-0.007,0.000,-0.048,-0.057,-0.101,-0.063,-0.114,-0.132,-0.103,-0.080,-0.098,-0.123,-0.087,-0.071,-0.050,-0.095,-0.047,-0.031,-0.028,-0.015,0.014,-0.019,0.048,0.026,0.039,0.084,0.036,0.071,0.055,0.019,0.059,0.038,0.040,0.013,0.044,0.078,0.040,0.098,0.015,-0.009,0.013,0.038,0.013,0.039,-0.008,0.024,-0.004,0.046,-0.004,-0.079,-0.032,-0.023,-0.015,-0.001,-0.028,-0.030,-0.054,-0.071,-0.046,-0.029,0.012,0.016,0.049,-0.020,0.012,0.016,-0.021,0.017,0.013,-0.008,0.057,0.028,0.056,0.114,0.073,0.078,0.133,0.056,0.057,0.096,0.061,0.096,0.081,0.100,0.092,0.057,0.028,0.055,0.025,0.082,0.087,0.070,-0.010,0.024,-0.025,0.018,0.016,0.007,0.020,-0.031,-0.045,-0.009,-0.060,-0.074,-0.072,-0.082,-0.100,-0.047,-0.089,-0.074,-0.070,-0.070,-0.070,-0.075,-0.070,-0.055,-0.078,-0.039,-0.050,-0.049,0.024,-0.026,-0.021,0.008,-0.026,-0.018,0.002,-0.009,-0.025,0.029,-0.040,-0.006,0.055,0.018,-0.035,-0.011,-0.026,-0.014,-0.006,-0.021,-0.031,-0.030,-0.056,-0.034,-0.026,-0.041,-0.107,-0.069,-0.082,-0.091,-0.096,-0.043,-0.038,-0.056,-0.068,-0.064,-0.042,-0.064,-0.058,0.016,-0.041,0.018,-0.008,0.058,0.006,0.007,0.060,0.011,0.050,-0.028,0.023,0.015,0.083,0.106,0.057,0.096,0.055,0.119,0.145,0.078,0.090,0.110,0.087,0.098,0.092,0.050,0.068,0.042,0.059,0.030,-0.005,-0.005,-0.013,-0.013,-0.016,0.008,-0.045,-0.021,-0.036,0.020,-0.018,-0.032,-0.038,0.021,-0.077,0.003,-0.010,-0.001,-0.024,-0.020,-0.022,-0.029,-0.053,-0.022,-0.007,-0.073,0.013,0.018,0.002,-0.038,0.024,0.025,0.033,0.008,0.016,-0.018,0.023,-0.001,-0.010,0.006,0.053,0.004,0.001,-0.003,0.009,0.019,0.024,0.031,0.024,0.009,-0.009,-0.035,-0.030,-0.031,-0.094,-0.006,-0.052,-0.061,-0.104,-0.098,-0.054,-0.161,-0.110,-0.078,-0.178,-0.052,-0.073,-0.051,-0.065,-0.029,-0.012,-0.053,-0.070,-0.040,-0.056,-0.004,-0.032,-0.065,-0.005,0.036,0.023,0.043,0.078,0.039,0.019,0.061,0.025,0.036,0.036,0.062,0.048,0.073,0.037,0.025,0.000,-0.007,-0.014,-0.050,-0.014,0.007,-0.035,-0.115,-0.039,-0.113,-0.102,-0.109,-0.158,-0.158,-0.133,-0.110,-0.170,-0.124,-0.115,-0.134,-0.097,-0.106,-0.155,-0.168,-0.038,-0.040,-0.074,-0.011,-0.040,-0.003,-0.019,-0.022,-0.006,-0.049,-0.048,-0.039,-0.011,-0.036,-0.001,-0.018,-0.037,-0.001,0.033,0.061,0.054,0.005,0.040,0.045,0.062,0.016,-0.007,-0.005,0.009,0.044,0.029,-0.016,-0.028,-0.021,-0.036,-0.072,-0.138,-0.060,-0.109,-0.064,-0.142,-0.081,-0.032,-0.077,-0.058,-0.035,-0.039,-0.013,0.007,0.007,-0.052,0.024,0.018,0.067,0.015,-0.002,-0.004,0.038,-0.010,0.056) V1=c(-0.083,-0.089,-0.042,-0.071,-0.043,-0.026,0.025,0.059,-0.019,0.107,0.049,0.089,0.094,0.090,0.120,0.169,0.173,0.159,0.141,0.157,0.115,0.128,0.154,0.083,0.038,0.081,0.129,0.120,0.112,0.074,0.022,-0.022,-0.028,-0.048,-0.027,-0.056,-0.027,-0.107,-0.020,-0.063,-0.069,-0.019,-0.055,-0.071,-0.027,-0.034,-0.018,-0.089,-0.068,-0.129,-0.034,-0.002,0.011,-0.009,-0.038,-0.013,-0.006,0.027,0.037,0.022,0.087,0.080,0.119,0.085,0.076,0.072,0.029,0.103,0.019,0.020,0.052,0.024,-0.051,-0.024,-0.008,0.011,-0.019,0.023,-0.011,-0.033,-0.101,-0.157,-0.094,-0.099,-0.106,-0.103,-0.139,-0.093,-0.098,-0.083,-0.118,-0.142,-0.155,-0.095,-0.122,-0.072,-0.034,-0.047,-0.036,0.014,0.035,-0.034,-0.012,0.054,0.030,0.060,0.091,0.013,0.049,0.083,0.070,0.127,0.048,0.118,0.123,0.099,0.097,0.074,0.125,0.051,0.107,0.069,0.040,0.102,0.100,0.119,0.087,0.077,0.044,0.091,0.020,0.010,-0.028,0.026,-0.018,-0.020,0.010,0.034,0.005,0.010,0.028,-0.043,0.025,-0.069,-0.003,0.004,-0.001,0.024,0.032,0.076,0.033,0.071,0.000,0.052,0.034,0.058,0.002,0.070,0.025,0.056,0.051,0.080,0.051,0.101,0.009,0.052,0.079,0.035,0.051,0.049,0.064,0.004,0.011,0.005,0.031,-0.021,-0.024,-0.048,-0.011,-0.072,-0.034,-0.020,-0.052,-0.069,-0.088,-0.093,-0.084,-0.143,-0.103,-0.110,-0.124,-0.175,-0.083,-0.117,-0.090,-0.090,-0.040,-0.068,-0.082,-0.082,-0.061,-0.013,-0.029,-0.032,-0.046,-0.031,-0.048,-0.028,-0.034,-0.012,0.006,-0.062,-0.043,0.010,0.036,0.050,0.030,0.084,0.027,0.074,0.082,0.087,0.079,0.031,0.003,0.001,0.038,0.002,-0.038,0.003,0.023,-0.011,0.013,0.003,-0.046,-0.021,-0.050,-0.063,-0.068,-0.085,-0.051,-0.052,-0.065,0.014,-0.016,-0.082,-0.026,-0.032,0.019,-0.026,0.036,-0.005,0.092,0.070,0.045,0.074,0.091,0.122,-0.007,0.094,0.064,0.087,0.063,0.083,0.109,0.062,0.096,0.036,-0.019,0.075,0.052,0.025,0.031,0.078,0.044,-0.018,-0.040,-0.039,-0.140,-0.037,-0.095,-0.056,-0.044,-0.039,-0.086,-0.062,-0.085,-0.023,-0.103,-0.035,-0.067,-0.096,-0.097,-0.060,0.003,-0.051,0.014,-0.002,0.054,0.045,0.073,0.080,0.096,0.104,0.126,0.144,0.136,0.132,0.160,0.155,0.136,0.080,0.144,0.087,0.093,0.103,0.151,0.165,0.146,0.159,0.156,0.002,0.023,-0.019,0.078,0.031,0.038,0.019,0.094,0.018,0.028,0.064,-0.052,-0.034,0.000,-0.074,-0.076,-0.028,-0.048,-0.025,-0.095,-0.098,-0.045,-0.016,-0.030,-0.036,-0.012,0.023,0.038,0.042,0.039,0.073,0.066,0.027,0.016,0.093,0.129,0.138,0.121,0.077,0.046,0.067,0.068,0.023,0.062,0.038,-0.007,0.055,0.006,-0.015,0.008,0.064,0.012,0.004,-0.055,0.018,0.042) U2=c(0.022,0.005,-0.022,0.025,-0.014,-0.020,-0.001,-0.021,-0.008,-0.006,-0.056,0.050,-0.068,0.018,-0.106,-0.053,-0.084,-0.082,-0.061,-0.041,-0.057,-0.123,-0.060,-0.029,-0.084,-0.004,0.030,-0.021,-0.036,-0.016,0.006,0.088,0.088,0.079,0.063,0.097,0.020,-0.048,0.046,0.057,0.065,0.042,0.022,0.016,0.041,0.109,0.024,-0.010,-0.084,-0.002,0.004,-0.033,-0.025,-0.020,-0.061,-0.060,-0.043,-0.027,-0.054,-0.054,-0.040,-0.077,-0.043,-0.014,0.030,-0.051,0.001,-0.029,0.008,-0.023,0.015,0.002,-0.001,0.029,0.048,0.081,-0.022,0.040,0.018,0.131,0.059,0.055,0.043,0.027,0.091,0.104,0.101,0.084,0.048,0.057,0.044,0.083,0.063,0.083,0.079,0.042,-0.021,0.017,0.005,0.001,-0.033,0.010,-0.028,-0.035,-0.012,-0.034,-0.055,-0.009,0.001,-0.084,-0.047,-0.020,-0.046,-0.042,-0.058,-0.071,0.013,-0.045,-0.070,0.000,-0.067,-0.090,0.012,-0.013,-0.013,-0.009,-0.063,-0.047,-0.030,0.046,0.026,0.019,0.007,-0.056,-0.062,0.009,-0.019,-0.005,0.003,0.022,-0.006,-0.019,0.020,0.025,0.040,-0.032,0.015,0.019,-0.014,-0.031,-0.047,0.010,-0.058,-0.079,-0.052,-0.044,0.012,-0.039,-0.007,-0.068,-0.095,-0.053,-0.066,-0.056,-0.033,-0.006,0.001,0.010,0.004,0.011,0.013,0.029,-0.011,0.007,0.023,0.087,0.054,0.040,0.013,-0.006,0.076,0.086,0.103,0.121,0.070,0.074,0.067,0.045,0.088,0.041,0.075,0.039,0.043,0.016,0.065,0.056,0.047,-0.002,-0.001,-0.009,-0.029,0.018,0.041,0.002,-0.022,0.003,0.008,0.031,0.003,-0.031,-0.015,0.014,-0.057,-0.043,-0.045,-0.067,-0.040,-0.013,-0.111,-0.067,-0.055,-0.004,-0.070,-0.019,0.009,0.009,0.032,-0.021,0.023,0.123,-0.032,0.040,0.012,0.042,0.038,0.037,-0.007,0.003,0.011,0.090,0.039,0.083,0.023,0.056,0.030,0.042,0.030,-0.046,-0.034,-0.021,-0.076,-0.017,-0.071,-0.053,-0.014,-0.060,-0.038,-0.076,-0.011,-0.005,-0.051,-0.043,-0.032,-0.014,-0.038,-0.081,-0.021,-0.035,0.014,-0.001,0.001,0.003,-0.029,-0.031,0.000,0.048,-0.036,0.034,0.054,0.001,0.046,0.006,0.039,0.015,0.012,0.034,0.022,0.015,0.033,0.037,0.012,0.057,0.001,-0.014,0.012,-0.007,-0.022,-0.002,-0.008,0.043,-0.041,-0.057,-0.006,-0.079,-0.070,-0.038,-0.040,-0.073,-0.045,-0.101,-0.092,-0.046,-0.047,-0.023,-0.028,-0.019,-0.086,-0.047,-0.038,-0.068,-0.017,0.037,-0.010,-0.016,0.010,-0.005,-0.031,0.004,-0.034,0.005,0.006,-0.015,0.017,-0.043,-0.007,-0.009,0.013,0.026,-0.036,0.011,0.047,-0.025,-0.023,0.043,-0.020,-0.003,-0.043,0.000,-0.018,-0.075,-0.045,-0.063,-0.043,-0.055,0.007,-0.063,-0.085,-0.031,0.005,-0.067,-0.059,-0.059,-0.029,-0.014,-0.040,-0.072,-0.018,0.039,-0.006,-0.001,-0.015,0.038,0.038,-0.009,0.026,0.017,0.056) V2=c(-0.014,0.001,0.004,-0.002,0.022,0.019,0.023,-0.023,0.030,-0.085,-0.007,-0.027,0.100,0.058,0.108,0.055,0.132,0.115,0.084,0.046,0.102,0.121,0.036,0.019,0.066,0.049,-0.011,0.020,0.023,0.011,0.041,0.009,-0.009,-0.023,-0.036,0.031,0.012,0.026,-0.011,0.009,-0.027,-0.033,-0.054,-0.004,-0.040,-0.048,-0.009,0.023,-0.028,0.022,0.090,0.060,0.040,0.003,-0.011,0.030,0.107,0.025,0.084,0.036,0.074,0.065,0.078,0.011,0.058,0.092,0.083,0.080,0.039,0.000,-0.027,0.035,0.011,0.004,0.023,-0.033,-0.060,-0.049,-0.101,-0.033,-0.105,-0.042,-0.088,-0.086,-0.093,-0.085,-0.028,-0.046,-0.045,-0.052,-0.009,-0.066,-0.073,-0.067,0.011,-0.057,-0.087,-0.066,-0.103,-0.075,0.003,-0.021,0.010,-0.013,0.021,0.020,0.084,0.028,0.127,0.050,0.104,0.097,0.075,0.021,0.057,0.095,0.080,0.077,0.086,0.110,0.054,0.016,0.105,0.065,0.046,0.047,0.072,0.058,0.092,0.063,0.033,0.087,0.036,0.049,0.093,0.008,0.064,0.068,0.040,0.049,0.035,0.042,0.045,0.021,0.056,0.007,0.026,0.067,0.046,0.088,0.084,0.070,0.037,0.079,0.065,0.074,0.077,0.023,0.094,0.061,0.096,0.068,0.067,0.091,0.061,0.069,0.090,0.046,0.057,0.011,-0.018,0.005,0.001,-0.023,-0.087,0.010,0.023,-0.025,-0.040,-0.059,-0.063,-0.075,-0.136,-0.078,-0.102,-0.128,-0.116,-0.091,-0.136,-0.083,-0.115,-0.063,-0.055,-0.080,-0.093,-0.099,-0.053,-0.042,-0.011,-0.034,-0.027,-0.042,-0.022,-0.008,-0.033,-0.039,-0.036,0.019,0.036,-0.002,0.000,-0.021,0.060,0.030,0.073,0.080,0.061,0.046,0.062,0.010,0.034,0.103,0.107,0.016,0.080,0.067,0.007,0.060,0.021,-0.026,0.008,0.051,0.030,0.001,-0.036,-0.047,0.000,0.006,0.006,0.013,0.009,0.019,0.009,-0.086,-0.020,0.018,0.039,0.014,0.011,0.052,0.031,0.095,0.047,0.065,0.114,0.086,0.102,0.037,0.039,0.060,0.024,0.091,0.058,0.065,0.060,0.045,0.031,0.062,0.047,0.043,0.057,0.032,0.057,0.051,0.019,0.056,0.024,-0.003,0.023,-0.013,-0.032,-0.022,-0.064,-0.021,-0.050,-0.063,-0.090,-0.082,-0.076,-0.077,-0.042,-0.060,-0.010,-0.060,-0.069,-0.028,-0.071,-0.046,-0.020,-0.074,0.080,0.071,0.065,0.079,0.065,0.039,0.061,0.154,0.072,0.067,0.133,0.106,0.080,0.047,0.053,0.110,0.080,0.122,0.075,0.052,0.034,0.081,0.118,0.079,0.101,0.053,0.082,0.036,0.033,0.026,0.002,-0.002,0.020,0.087,0.021,0.034,0.003,-0.021,0.016,-0.009,-0.045,-0.043,-0.020,0.027,0.008,-0.006,0.043,0.045,0.014,0.053,0.083,0.113,0.091,0.028,0.060,0.040,0.019,0.114,0.126,0.090,0.046,0.089,0.029,0.030,0.010,0.045,0.040,0.072,-0.033,-0.008,0.014,-0.018,-0.004,-0.037,0.015,-0.021,-0.015) bindistances=c(1.37,1.62,1.87,2.12,2.37,2.62,2.87,3.12,3.37,3.62,3.87,4.12,4.37,4.62,4.87,5.12,5.37,5.62,5.87,6.12,6.37,6.62,6.87,7.12,7.37,7.62,7.87,8.12) Then, as a representation of currents: AA=14 x11() par(mfrow=c(4,1)) plotSticks(x=seq(from=(1), to=(377), by=(1)), u=U1, v=V1, yscale=ysc,xlab='',ylab='',xaxt='n',yaxt='n',col=(rep('black',384))) axis(side=1) plotSticks(x=seq(from=(1), to=(377), by=(1)), u=U2, v=V2, yscale=ysc,xlab='',ylab='',xaxt='n',yaxt='n',col=(rep('black',384))) plotSticks(x=seq(from=(1), to=(377), by=(1)), u=U2, v=V2, yscale=ysc,xlab='',ylab='',xaxt='n',yaxt='n',col=(rep('black',384))) plotSticks(x=seq(from=(1), to=(377), by=(1)), u=U2, v=V2, yscale=ysc,xlab='',ylab='',xaxt='n',yaxt='n',col=(rep('black',384))) In order to simplify the representation, the three last plots are based on the same data.
Finding a density peak / cluster centrum in 2D grid / point process
I have a dataset with minute by minute GPS coordinates recorded by a persons cellphone. I.e. the dataset has 1440 rows with LON/LAT values. Based on the data I would like a point estimate (lon/lat value) of where the participants home is. Let's assume that home is the single location where they spend most of their time in a given 24h interval. Furthermore, the GPS sensor most of the time has quite high accuracy, however sometimes it is completely off resulting in gigantic outliers. I think the best way to go about this is to treat it as a point process and use 2D density estimation to find the peak. Is there a native way to do this in R? I looked into kde2d (MASS) but this didn't really seem to do the trick. Kde2d creates a 25x25 grid of the data range with density values. However, in my data, the person can easily travel 100 miles or more per day, so these blocks are generally too large of an estimate. I could narrow them down and use a much larger grid but I am sure there must be a better way to get a point estimate.
There are "time spent" functions in the trip package (I'm the author). You can create objects from the track data that understand the underlying track process over time, and simply process the points assuming straight line segments between fixes. If "home" is where the largest value pixel is, i.e. when you break up all the segments based on the time duration and sum them into cells, then it's easy to find it. A "time spent" grid from the tripGrid function is a SpatialGridDataFrame with the standard sp package classes, and a trip object can be composed of one or many tracks. Using rgdal you can easily transform coordinates to an appropriate map projection if lon/lat is not appropriate for your extent, but it makes no difference to the grid/time-spent calculation of line segments. There is a simple speedfilter to remove fixes that imply movement that is too fast, but that is very simplistic and can introduce new problems, in general updating or filtering tracks for unlikely movement can be very complicated. (In my experience a basic time spent gridding gets you as good an estimate as many sophisticated models that just open up new complications). The filter works with Cartesian or long/lat coordinates, using tools in sp to calculate distances (long/lat is reliable, whereas a poor map projection choice can introduce problems - over short distances like humans on land it's probably no big deal). (The function tripGrid calculates the exact components of the straight line segments using pixellate.psp, but that detail is hidden in the implementation). In terms of data preparation, trip is strict about a sensible sequence of times and will prevent you from creating an object if the data have duplicates, are out of order, etc. There is an example of reading data from a text file in ?trip, and a very simple example with (really) dummy data is: library(trip) d <- data.frame(x = 1:10, y = rnorm(10), tms = Sys.time() + 1:10, id = gl(1, 5)) coordinates(d) <- ~x+y tr <- trip(d, c("tms", "id")) g <- tripGrid(tr) pt <- coordinates(g)[which.max(g$z), ] image(g, col = c("transparent", heat.colors(16))) lines(tr, col = "black") points(pt[1], pt[2], pch = "+", cex = 2) That dummy track has no overlapping regions, but it shows that finding the max point in "time spent" is simple enough.
How about using the location that minimises the sum squared distance to all the events? This might be close to the supremum of any kernel smoothing if my brain is working right. If your data comprises two clusters (home and work) then I think the location will be in the biggest cluster rather than between them. Its not the same as the simple mean of the x and y coordinates. For an uncertainty on that, jitter your data by whatever your positional uncertainty is (would be great if you had that value from the GPS, otherwise guess - 50 metres?) and recompute. Do that 100 times, do a kernel smoothing of those locations and find the 95% contour. Not rigorous, and I need to experiment with this minimum distance/kernel supremum thing...
In response to spacedman - I am pretty sure least squares won't work. Least squares is best known for bowing to the demands of outliers, without much weighting to things that are "nearby". This is the opposite of what is desired. The bisquare estimator would probably work better, in my opinion - but I have never used it. I think it also requires some tuning. It's more or less like a least squares estimator for a certain distance from 0, and then the weighting is constant beyond that. So once a point becomes an outlier, it's penalty is constant. We don't want outliers to weigh more and more and more as we move away from them, we would rather weigh them constant, and let the optimization focus on better fitting the things in the vicinity of the cluster.
Minimising interpolation error between two data sets
In the top of the diagrams below we can see some value (y-axis) changing over time (x-axis). As this happens we are sampling the value at different and unpredictable times, also we are alternating the sampling between two data sets, indicated by red and blue. When computing the value at any time, we expect that both red and blue data sets will return similar values. However as shown in the three smaller boxes this is not the case. Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value. Initially I used linear interpolation to obtain a value, next I tried using Catmull-Rom interpolation. The former results in a values come close together and then drift apart between each data point; the latter results in values which remain closer, but where the average error is greater. Can anyone suggest another strategy or interpolation method which will provide greater smoothing (perhaps by using a greater number of sample points from each data set)?
I believe what you ask is a question that does not have a straight answer without further knowledge on the underlying sampled process. By its nature, the value of the function between samples can be merely anything, so I think there is no way to assure the convergence of the interpolations of two sample arrays. That said, if you have a prior knowledge of the underlying process, then you can choose among several interpolation methods to minimize the errors. For example, if you measure the drag force as a function of the wing velocity, you know the relation is square (a*V^2). Then you can choose polynomial fitting of the 2nd order and have pretty good match between the interpolations of the two serieses.
Try B-splines: Catmull-Rom interpolates (goes through the data points), B-spline does smoothing. For example, for uniformly-spaced data (not your case) Bspline(t) = (data(t-1) + 4*data(t) + data(t+1)) / 6 Of course the interpolated red / blue curves depend on the spacing of the red / blue data points, so cannot match perfectly.
I'd like to quote Introduction to Catmull-Rom Splines to suggest not using Catmull-Rom for this interpolation task. One of the features of the Catmull-Rom spline is that the specified curve will pass through all of the control points - this is not true of all types of splines. By definition your red interpolated curve will pass through all red data points and your blue interpolated curve will pass through all blue points. Therefore you won't get a best fit for both data sets. You might change your boundary conditions and use data points from both data sets for a piecewise approximation as shown in these slides.
I agree with ysap that this question cannot be answered as you may be expecting. There may be better interpolation methods, depending on your model dynamics - as with ysap, I recommend methods that utilize the underlying dynamics, if known. Regarding the red/blue samples, I think you have made a good observation about sampled and interpolated data sets and I would challenge your original expectation that: When computing the value at any time, we expect that both red and blue data sets will return similar values. I do not expect this. If you assume that you cannot perfectly interpolate - and particularly if the interpolation error is large compared to the errors in samples - then you are certain to have a continuous error function that exhibits largest errors longest (time) from your sample points. Therefore two data sets that have differing sample points should exhibit the behaviour you see because points that are far (in time) from red sample points may be near (in time) to blue sample points and vice versa - if staggered as your points are, this is sure to be true. Thus I would expect what you show, that: Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value. (If you do not have information about underlying dynamics (except frequency content), then Giacomo's points on sampling are key - however, you need not interpolate if looking at info below Nyquist.)
When sampling the original continuous function, the sampling frequency should comply to the Nyquist-Shannon sampling theorem, otherwise the sampling process introduces an error (also known as aliasing). The error, being different in the two datasets, results in a different value when you interpolate. Therefore, you need to know the highest frequency B of the original function and then collect samples with a frequency at least 2B. If your function has very high frequencies and you cannot sample that fast, you should at least try to filter them away before sampling.