I am capturing some points on an X, Y plane to represent some values. I ask the user to select a few points and I want the system to then generate a curve following the trend that the user creates. How do I calculate this? So say it is this:
Y = dollar amount
X = unit count
user input: (2500, 200), (4500, 500), (9500, 1000)
Is there a way I can calculate some sort of curve to follow those points so I would know based off that selection what Y = 100 would be on the same scale/trend?
EDIT: People keep asking for the nature of the curve, yes logarithmic. But I'd also like to check out some other options. It's for pricing the the restraint is that the as X increases Y should always be higher. However the rate of change of the curve should change related to the two adjacent points that the user selected, we could probably require a certain number of points. Does that help?
EDIT: Math is hard.
EDIT: Maybe a parabola then?
The problem is that there are multiple curves that you can fit to the same data. To borrow an example from my old stats book, here is the same data set (1, 1, 1, 10, 1, 1, 1) with four curves:
You need to specify the overall trend to get a meaningful result.
First, you are going to have to have an idea of what your line is or better said, what type of line fits your data the best. Is it linear (straight line) or does it curve (x-squared). Sounds like this is a curve.
If your curve is a parabola, then you will need to solve y = Ax(2) + Bx + c using your three points that the user has chosen. You will need at least 3 points to solve for 3 unknowns.
200 = A(2500)(2) + B(2500) + C
500 = A(4500)(2) + B(4500) + C
1000 = A(9500)(2) + B(9500) + C
Given these three equations, you should be able to solve for A, B and C, then use these to plot a new curve.
The Least Square Fit would give you a nice data matching curve.
This is a rather general extrapolation problem. In your case, fitting a quadric (parabola) is probably the most reasonable course of action. Depending on how well your data fits a quadric, you may want to fit it to more than 3 points (the noisier and weirder the data, the more points you should use).
Depending on the amount and type of data you have, you may want to try LOESS regression.
However, this may not be a good option if you only have 3 points as in your example (but keep in mind that you will not be able to have good extrapolation with 3 points no matter the algorithm you use)
Another option would be B-splines
Related
I am back at college learning maths and I want to try and use some this knowledge to create some svg with d3.js.
If I have a function f(x) = x^3 - 3x^2 + 3x - 1
I would take the following steps:
Find the x intercepts for when y = 0
Find the y intercept when x = 0
Find the stationary points when dy\dx = 0
I would then have 2 x values from point 3 to plug into the original equation.
I would then draw a nature table do judge the flow of the graph or curve.
Plot the known points from the above and sketch the graph.
Translating what I would do on pen and paper into code instructions is what I really could do with any sort of advice on the following:
How can I programmatically factorise point 1 of the above to find the x-intercepts for when y = 0. I honestly do not know where to even start.
How would I programmatically find dy/dx and the values for the stationary points.
If I actually get this far then what should I use in d3 to join the points on the graph.
Your other "steps" have nothing to do with d3 or plotting.
Find the x intercepts for when y = 0
This is root finding. Look for algorithms to help with this.
Find the y intercept when x = 0
Easy: substitute to get y = 1.
Find the stationary points when dy\dx = 0
Take the first derivative to get 3x^2 - 12x + 9 and repeat the root finding step. Easy to get using quadratic equation.
I would then have 2 x values from point 3 to plug into the original
equation. I would then draw a nature table do judge the flow of the
graph or curve. Plot the known points from the above and sketch the
graph.
I would just draw the curve. Pick a range for x and go.
It's great to learn d3. You'll end up with something like this:
https://maurizzzio.github.io/function-plot/
For a cubic polynomial, there are closed formulas available to find all the particular points that you want (https://en.wikipedia.org/wiki/Cubic_function), and it is a sound approach to determine them.
Anyway, you will have to plot the smooth curve, which means that you will need to compute close enough points and draw a polyline that joins them.
Doing this, you are actually performing the first steps of numerical root isolation, with such an accuracy that the approximate and exact roots will be practically undistinguishable.
So an easy combined solution is to draw the curve as a polyline and find the intersections with the X axis as well as extrema using this polyline representation, rather than by means of more sophisticated methods.
This approach works for any continuous curve and is very easy to implement. So you actually draw the curve to find particular points rather than conversely as is done by analytical methods.
For best results on complicated curves, you can adapt the point density based on the local curvature, but this is another story.
The only equation to calculate this that I can find involves t in the range [0, 1], but I have no idea how long it will take to travel the entire path, so I can't calculate (1 - t).
I know the speed at which I'm traveling, but it seems to be a heavy idea to calculate the total time beforehand (nor do I actually know how to do that calculation). What is an equation to figure out the position without knowing the total time?
Edit To clarify on the cubic bezier curve: I have four control points (P0 to P1), and to get a value on the curve with t, I need to use the four points as such:
B(t) = (1-t)^3P0 + 3t(1-t)^2P1 + 3t^2(1-t)P2 + t^3P3
I am not using a parametric equation to define the curve. The control points are what define the curve. What I need is an equation that does not require the use of knowing the range of t.
I think there is a misunderstanding here. The 't' in the cubic Bezier curve's definition does not refer to 'time'. It is parameter that the x, y or even z functions based on. Unlike the traditional way of representing y as a function of x, such as y=f(x), an alternative way of representing a curve is by the parametric form that represents x, y and z as functions of an additional parameter t, C(t)=(x(t), y(t), z(t)). Typically the t value will range from 0 to 1, but this is not a must. The common representation for a circle as x=cos(t) and y=sin(t) is an example of parametric representation. So, if you have the parametric representation of a curve, you can evaluate the position on the curve for any given t value. It has nothing to do with the time it takes to travel the entire path.
You have the given curve and you have your speed. To calculate what you're asking for you need to divide the total distance by the speed you traveled given that time. That will give you the parametric (t) you need. So if the total curve has a distance of 72.2 units and your speed is 1 unit then your t is 1/72.2.
Your only missing bit is calculating the length of a given curve. This is typically done by subdividing it into line segments small enough that you don't care, and then adding up the total distance of those line segments. You could likely combine those two steps as well if you were so inclined. If you have your given speed, just iteration like 1000th of the curve add the line segment between the start and point 1000th of the way through the curve, and subtract that from how far you need to travel (given that you have speed and time, you have distance you need to travel), and keep that up until you've gone as far as you need to go.
The range for t is between 0 and 1.
x = (1-t)*(1-t)*(1-t)*p0x + 3*(1-t)*(1-t)*t*p1x + 3*(1-t)*t*t*p2x + t*t*t*p3x;
y = (1-t)*(1-t)*(1-t)*p0y + 3*(1-t)*(1-t)*t*p1y + 3*(1-t)*t*t*p2y + t*t*t*p3y;
I have two sequences of length n and m. Each is a sequence of points of the form (x,y) and represent curves in an image. I need to find how different (or similar) these sequences are given that fact that
one sequence is likely longer than the other (i.e., one can be half or a quarter as long as the other, but if they trace approximately the same curve, they are the same)
these sequences could be in opposite directions (i.e., sequence 1 goes from left to right, while sequence 2 goes from right to left)
I looked into some difference estimates like Levenshtein as well as edit-distances in structural similarity matching for protein folding, but none of them seem to do the trick. I could write my own brute-force method but I want to know if there is a better way.
Thanks.
Do you mean that you are trying to match curves that have been translated in x,y coordinates? One technique from image processing is to use chain codes [I'm looking for a decent reference, but all I can find right now is this] to encode each sequence and then compare those chain codes. You could take the sum of the differences (modulo 8) and if the result is 0, the curves are identical. Since the sequences are of different lengths and don't necessarily start at the same relative location, you would have to shift one sequence and do this again and again, but you only have to create the chain codes once. The only way to detect if one of the sequences is reversed is to try both the forward and reverse of one of the sequences. If the curves aren't exactly alike, the sum will be greater than zero but it is not straightforward to tell how different the curves are simply from the sum.
This method will not be rotationally invariant. If you need a method that is rotationally invariant, you should look at Boundary-Centered Polar Encoding. I can't find a free reference for that, but if you need me to describe it, let me know.
A method along these lines might work:
For both sequences:
Fit a curve through the sequence. Make sure that you have a continuous one-to-one function from [0,1] to points on this curve. That is, for each (real) number between 0 and 1, this function returns a point on the curve belonging to it. By tracing the function for all numbers from 0 to 1, you get the entire curve.
One way to fit a curve would be to draw a straight line between each pair of consecutive points (it is not a nice curve, because it has sharp bends, but it might be fine for your purpose). In that case, the function can be obtained by calculating the total length of all the line segments (Pythagoras). The point on the curve corresponding to a number Y (between 0 and 1) corresponds to the point on the curve that has a distance Y * (total length of all line segments) from the first point on the sequence, measured by traveling over the line segments (!!).
Now, after we have obtained such a function F(double) for the first sequence, and G(double) for the second sequence, we can calculate the similarity as follows:
double epsilon = 0.01;
double curveDistanceSquared = 0.0;
for(double d=0.0;d<1.0;d=d+epsilon)
{
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(d);
//alternatively, use G(1.0-d) to check whether the second sequence is reversed
double distanceOfPoints = pointOnCurve1.EuclideanDistance(pointOnCurve2);
curveDistanceSquared = curveDistanceSquared + distanceOfPoints * distanceOfPoints;
}
similarity = 1.0/ curveDistanceSquared;
Possible improvements:
-Find an improved way to fit the curves. Note that you still need the function that traces the curve for the above method to work.
-When calculating the distance, consider reparametrizing the function G in such a way that the distance is minimized. (This means you have an increasing function R, such that R(0) = 0 and R(1)=1,
but which is otherwise general. When calculating the distance you use
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(R(d));
Subsequently, you try to choose R in such a way that the distance is minimized. (to see what happens, note that G(R(d)) also traces the curve)).
Why not do some sort of curve fitting procedure (least-squares whether it be ordinary or non-linear) and see if the coefficients on the shape parameters are the same. If you run it as a panel-data sort of model, there are explicit statistical tests whether sets of parameters are significantly different from one another. That would solve the problem of the the same curve but sampled at different resolutions.
Step 1: Canonicalize the orientation. For example, let's say that all curved start at the endpoint with lowest lexicographic order.
def inCanonicalOrientation(path):
return path if path[0]<path[-1] else reversed(path)
Step 2: You can either be roughly accurate, or very accurate. If you wish to be very accurate, calculate a spline, or fit both curves to a polynomial of appropriate degree, and compare coefficients. If you'd like just a rough estimate, do as follows:
def resample(path, numPoints)
pathLength = pathLength(path) #write this function
segments = generateSegments(path)
currentSegment = next(segments)
segmentsSoFar = [currentSegment]
for i in range(numPoints):
samplePosition = i/(numPoints-1)*pathLength
while samplePosition > pathLength(segmentsSoFar)+currentSegment.length:
currentSegment = next(segments)
segmentsSoFar.insert(currentSegment)
difference = samplePosition - pathLength(segmentsSoFar)
howFar = difference/currentSegment.length
yield Point((1-howFar)*currentSegment.start + (howFar)*currentSegment.end)
This can be modified from a linear resampling to something better.
def error(pathA, pathB):
pathA = inCanonicalOrientation(pathA)
pathB = inCanonicalOrientation(pathB)
higherResolution = max([len(pathA), len(pathB)])
resampledA = resample(pathA, higherResolution)
resampledB = resample(pathA, higherResolution)
error = sum(
abs(pointInA-pointInB)
for pointInA,pointInB in zip(pathA,pathB)
)
averageError = error / len(pathAorB)
normalizedError = error / Z(AorB)
return normalizedError
Where Z is something like the "diameter" of your path, perhaps the maximum Euclidean distance between any two points in a path.
I would use a curve-fitting procedure, but also throw in a constant term, i.e. 0 =B0 + B1*X + B2*Y + B3*X*Y + B4*X^2 etc. This would catch the translational variance and then you can do a statistical comparison of the estimated coefficients of the curves formed by the two sets of points as a way of classifying them. I'm assuming you'll have to do bi-variate interpolation if the data form arbitrary curves in the x-y plane.
I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.
I am trying to plot large amounts of points using some library. The points are ordered by time and their values can be considered unpredictable.
My problem at the moment is that the sheer number of points makes the library take too long to render. Many of the points are redundant (that is - they are "on" the same line as defined by a function y = ax + b). Is there a way to detect and remove redundant points in order to speed rendering ?
Thank you for your time.
The following is a variation on the Ramer-Douglas-Peucker algorithm for 1.5d graphs:
Compute the line equation between first and last point
Check all other points to find what is the most distant from the line
If the worst point is below the tolerance you want then output a single segment
Otherwise call recursively considering two sub-arrays, using the worst point as splitter
In python this could be
def simplify(pts, eps):
if len(pts) < 3:
return pts
x0, y0 = pts[0]
x1, y1 = pts[-1]
m = float(y1 - y0) / float(x1 - x0)
q = y0 - m*x0
worst_err = -1
worst_index = -1
for i in xrange(1, len(pts) - 1):
x, y = pts[i]
err = abs(m*x + q - y)
if err > worst_err:
worst_err = err
worst_index = i
if worst_err < eps:
return [(x0, y0), (x1, y1)]
else:
first = simplify(pts[:worst_index+1], eps)
second = simplify(pts[worst_index:], eps)
return first + second[1:]
print simplify([(0,0), (10,10), (20,20), (30,30), (50,0)], 0.1)
The output is [(0, 0), (30, 30), (50, 0)].
About python syntax for arrays that may be non obvious:
x[a:b] is the part of array from index a up to index b (excluded)
x[n:] is the array made using elements of x from index n to the end
x[:n] is the array made using first n elements of x
a+b when a and b are arrays means concatenation
x[-1] is the last element of an array
An example of the results of running this implementation on a graph with 100,000 points with increasing values of eps can be seen here.
I came across this question after I had this very idea. Skip redundant points on plots. I believe I came up with a far better and simpler solution and I'm happy to share as my first proposed solution on SO. I've coded it and it works well for me. It also takes into account the screen scale. There may be 100 points in value between those plot points, but if the user has a chart sized small, they won't see them.
So, iterating through your data/plot loop, before you draw/add your next data point, look at the next value ahead and calculate the change in screen scale (or value, but I think screen scale for the above-mentioned reason is better). Now do the same for the next value ahead (getting these values is just a matter of peeking ahead in your array/collection/list/etc adding the for next step increment (probably 1/2) to the current for value whilst in the loop). If the 2 values are the same (or perhaps very minor change, per your own preference), you can skip this one point in your chart by simply adding 'continue' in the loop, skipping adding the data point as the point lies exactly on the slope between the point before and after it.
Using this method, I reduce a chart from 963 points to 427 for example, with absolutely zero visual change.
I think you might need to perhaps read this a couple of times to understand, but it's far simpler than the other best solution mentioned here, much lighter weight, and has zero visual effect on your plot.
I would probably apply a "least squares" algorithm to obtain a line of best fit. You can then go through your points and downfilter consecutive points that lie close to the line. You only need to plot the outliers, and the points that take the curve back to the line of best fit.
Edit: You may not need to employ "least squares"; if your input is expected to hover around "y=ax+b" as you say, then that's already your line of best fit and you can just use that. :)