Minimising interpolation error between two data sets - math

In the top of the diagrams below we can see some value (y-axis) changing over time (x-axis).
As this happens we are sampling the value at different and unpredictable times, also we are alternating the sampling between two data sets, indicated by red and blue.
When computing the value at any time, we expect that both red and blue data sets will return similar values. However as shown in the three smaller boxes this is not the case. Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
Initially I used linear interpolation to obtain a value, next I tried using Catmull-Rom interpolation. The former results in a values come close together and then drift apart between each data point; the latter results in values which remain closer, but where the average error is greater.
Can anyone suggest another strategy or interpolation method which will provide greater smoothing (perhaps by using a greater number of sample points from each data set)?

I believe what you ask is a question that does not have a straight answer without further knowledge on the underlying sampled process. By its nature, the value of the function between samples can be merely anything, so I think there is no way to assure the convergence of the interpolations of two sample arrays.
That said, if you have a prior knowledge of the underlying process, then you can choose among several interpolation methods to minimize the errors. For example, if you measure the drag force as a function of the wing velocity, you know the relation is square (a*V^2). Then you can choose polynomial fitting of the 2nd order and have pretty good match between the interpolations of the two serieses.

Try B-splines: Catmull-Rom interpolates (goes through the data points), B-spline does smoothing.
For example, for uniformly-spaced data (not your case)
Bspline(t) = (data(t-1) + 4*data(t) + data(t+1)) / 6
Of course the interpolated red / blue curves depend on the spacing of the red / blue data points,
so cannot match perfectly.

I'd like to quote Introduction to Catmull-Rom Splines to suggest not using Catmull-Rom for this interpolation task.
One of the features of the Catmull-Rom
spline is that the specified curve
will pass through all of the control
points - this is not true of all types
of splines.
By definition your red interpolated curve will pass through all red data points and your blue interpolated curve will pass through all blue points. Therefore you won't get a best fit for both data sets.
You might change your boundary conditions and use data points from both data sets for a piecewise approximation as shown in these slides.

I agree with ysap that this question cannot be answered as you may be expecting. There may be better interpolation methods, depending on your model dynamics - as with ysap, I recommend methods that utilize the underlying dynamics, if known.
Regarding the red/blue samples, I think you have made a good observation about sampled and interpolated data sets and I would challenge your original expectation that:
When computing the value at any time, we expect that both red and blue data sets will return similar values.
I do not expect this. If you assume that you cannot perfectly interpolate - and particularly if the interpolation error is large compared to the errors in samples - then you are certain to have a continuous error function that exhibits largest errors longest (time) from your sample points. Therefore two data sets that have differing sample points should exhibit the behaviour you see because points that are far (in time) from red sample points may be near (in time) to blue sample points and vice versa - if staggered as your points are, this is sure to be true. Thus I would expect what you show, that:
Viewed over time the values from each data set (red and blue) will appear to diverge and then converge about the original value.
(If you do not have information about underlying dynamics (except frequency content), then Giacomo's points on sampling are key - however, you need not interpolate if looking at info below Nyquist.)

When sampling the original continuous function, the sampling frequency should comply to the Nyquist-Shannon sampling theorem, otherwise the sampling process introduces an error (also known as aliasing). The error, being different in the two datasets, results in a different value when you interpolate.
Therefore, you need to know the highest frequency B of the original function and then collect samples with a frequency at least 2B. If your function has very high frequencies and you cannot sample that fast, you should at least try to filter them away before sampling.

Related

Spatstat, using the Matérn cluster process to generate homogeneous landscapes, how do I interpret the Ripley K function?

I am looking to develop a point process that ranges from homogeneous, i.e. no correlation between points to a point cluster process that does have correlation between points. From experimentation I can see that using the Matérn cluster process I can generate landscapes that are clustered.
library(spatstat)
plot(rMatClust(kappa=3,r=0.1,mu=50))
I want to use the simplest code that increases the level of homogeneity, i.e. decreasing dependence of points on each other. I do not want to use a binary model where either the pattern is homogeneous or not. i.e. Just a poisson process which can be generated such as:
plot(rpoispp(150))
From experimentation I noticed that if I increase the radius of the clusters using the Matérn cluster process, I do seem to create a pseudo homogeneous pattern.
plot(rMatClust(kappa=3,r=0.3,mu=50))
plot(rMatClust(kappa=3,r=0.7,mu=50))
Is this a good way of generating degrees of homogeneity? I understand that I can use statistical tests to measure the degree of clustering compared to a complete poisson process, such as the Ripley K test. For example, if I assign the Matérn cluster process data to variables, such as:
a<-rMatClust(kappa=3,r=0.1,mu=50)
b<-rMatClust(kappa=3,r=0.3,mu=50)
c<-rMatClust(kappa=3,r=0.7,mu=50)
Then use the Ripley K test and plot the results:
plot(Kest(a))
plot(Kest(b))
plot(Kest(c))
I can see that the difference between a homogeneous poisson process and the clustered point process decreases. I still do not fully understand the significance of the various K values according to edge effects and so forth, and how to interpret the Ripley K function, but I think this is the right direction to be heading in? How do I interpret the Ripley K function? Another problem is the number of points in each plot, I do not have a consistent number of points in each plot, as can be seen by:
summary(a)
summary(b)
summary(c)
Any knowledgeable feedback on this is greatly appreciated.
The standard terminology is that you want to generate a clustered point pattern.
The function rMatClust generates a clustered point pattern at random, in a two-stage process. The first stage is to generate "parent" points completely at random. The second stage is to generate, for each "parent", a random number of "offspring" points, and to place the "offspring" points inside a circle of radius R around their "parent". The final result is the collection of all "offspring" points. From this description (and help(rMatClust)) you can figure out what happens for different parameter values.
The K function (not the "K test") is a summary of the spacing between points in a point pattern. At a distance r, the value of K(r) is the normalised average number of points observed to fall within distance r of a typical point in the pattern. It is normalised so that it does not depend on the number of points, making it possible to compare patterns with different numbers of points.
When you plot the K function, one of the curves is the theoretical curve that would be expected if the points are completely random, and the other curves are computed from the data point pattern. This allows you to assess whether the point pattern appears to be clustered.
I strongly suggest you do some reading in Chapter 7 of the spatstat book. You can download this chapter for free.

How would I fit a polynomial to a given set of points, that falls inside a given range? (ideally done manually without computer)

Let's say I've fitted a polynomial to a set of 7 data points using gaussian elimination to bring a matrix to row echelon and then reduced row echelon form. I've done this all by hand, and when graphed, the polynomial goes through each point. Success! BUT, the polynomial goes too far up in between a couple of these points. Ideally, the polynomial doesn't go above the highest data point, or below the lowest data point. I don't care what it does outside of the domain of the data points. Right now it goes far above the highest data point, so it is effectively useless for my case.
Is there any way I can redo these calculations, but in a way that ensures the polynomial falls within a given range (inside the domain of the data points)? After calculating the polynomial, I can restrict domain so that it doesn't extrapolate outside of the given data, but I CAN'T restrict range because it will make the function discontinuous.
Ideally I can do all of this by hand, without a computer, but I'm open to other options.
Thanks!

Detect peaks at beginning and end of x-axis

I've been working on detecting peaks within a data set of thousands of y~x relationships. Thanks to this post, I've been using loess and rollapply to detect peaks by comparing the local maximum to the smooth. Since, I've been working to optimise the span and w thresholds for loess and rollapply functions, respectively.
However, I have realised that several of my relationships have a peak at the beginning or the end on the x-axis, which are of my interest. But these peaks are not being identified. For now, I've tried to add fake variables outside of my x variable range to imitate a peak. For example, if my x values range from -50 to 160, I created x values of -100 and 210 and assigned a 0 y value to them.
This helped me to identify some of the relationships that have a peak at the beginning or the end. As you can see here:
However, for some it does not work.
Despite the fact that I feel uncomfortable adding 'fake' values to the relationship, the smoothing shifts the location of the peak frequently and more importantly, I cannot find a solution that allows to detect these beginning or end peaks. Does anyone know how to work out a solution that works in R?

Plotting cosine wave samples in Maple

I'm having trouble with Maple.
I have a cosine wave, which I figured out how to plot, but now I have to take samples
from that wave and plot those(as dots) over top of the original cosine wave.
Here is the question from the assignment:
"Produce the samples from Q1 above and plot the result (plot the points on a plot of the cosine wave - use different colours for both, it will look like a cosine wave with dots on it)"
Problem is, my samples keep being straight lines at different heights
http://i197.photobucket.com/albums/aa221/Haseo_Ame/Maple.png
I'm not sure what I'm doing wrong since I've never used maple before.
Firstly, try not to build up lists using repeated concatenation (which can incur an O(n^2) in resources) if you can use the seq command instead (which can incur an O(n) cost in resources). You should always reconsider, when coding like s:=[op(s),...] in a loop.
Next, a point-plot needs pairs of x-y values. Your list is just a collection of scalar values, and hence is being interpreted as a collection of constant functions to be plotted.
The pairs of x-y values can be in a list of (2-element) lists such as [[x1,y1],...,[xn,yn]
It's not clear how you want your x-axis scaled, but you could start off with something like this,
s:=[seq([i, 4*cos(2*Pi*i*70/200+Pi/4)],i=0..20)]:
plot(s, style=point);
# s:=[seq([2*Pi*i*70/200+Pi/4, 4*cos(2*Pi*i*70/200+Pi/4)],i=0..20)]:
ps. Please post source code as text, not as embedded images, so that anyone trying to help needn't type it all in.

How to detect a trend inside unsteady data (e.g. Trendly)?

I was wondering what kind of model / method / technique Trendly might use to achieve this model:
[It tries to find the moments where significant changes set in and ignores random movements]
Any pointers very welcome! :)
I've never seen 'Trendly', and don't know anything about it, but if I wanted to produce that red line from that blue line, in an algorithmic fashion, I would try:
Fourier the whole data set
Choose a block size longer than the period of the dominant frequency
Divide the data up into blocks of the chosen size
Compare adjacent ones with a statistical test of some sort.
Where the test says two blocks belong to the same underlying distribution, merge them.
If any were merged, go back to 4.
Red trend line is the mean of each block.
A simple "median" function could produce smoother curves over a mostly un-smooth curve.
Otherwise, a brute-force or genetic algorithm could be used; attempting to find the way to split the data into sections, so that more sections = worse solution, and less accuracy of the lines = worse solution.
Another way would be like this: Start at the beginning. As soon as the line moves outside of some radius (3 above or 3 below the first, for instance) set the new height to an average of the current line's height and the previous marker.
If you keep doing that, it would ignore small fluctuations. However, if the fluctuation was large enough, it would still effect it.

Resources