Can anyone figure out a function that can perform a mapping from a finite set of N numbers X = {x0, x1, x2, ..., xN} where each x can be valued 0 to 999999999 and N < 999999999, to a set Y = {0, 1, 2, 3, ..., N}.
In my case, i have about 24000000 element in the first set whose values can range as X. This elements have continuous block (for example 53000 to 1234500, then 8000000 to 9000000 and so on) and i have to remap this elements from 0 to 2400000. I don't require to maintain order.
I need a (possibly simple and rapid) math function, or a bitwise transformation, not something like put it ordered into an array and then binary search for their position.
Really thank to whom that can figure out a way to solve this!
Luca
If you don't want to keep some gigabytes of straight map, then augmented segment tree is reasonable approach. Tree should contain intervals and shift of every interval (sum of left intervals). Of course, finding appropriate interval (and shift) in this method is close to the binary search.
For example, you get X=80000015. Find interval for this value - it is 8000000 to 9000000. Rank of this interval is 175501 (1234500-53000 + 1). So X maps to
X => 175501 + 80000015 - 80000000 = 175516
For sparse elements make counting stage - find what is rank R for every number M and put (key=M, value=R) pair in hash table.
X = (3, 19, 20, 101)
table: [(3:0), (19:1), (20:2), (101:3)]
Note that one should keep balance between speed and space - for long filled intervals it is better to store only interval ends.
I am trying to replicate the following figure in R: (adapted from http://link.springer.com/article/10.1007/PL00011669)
The basic concept of the figure is to show the first few components of a DFT, plotted in the time domain, and then show a reconstructed wave in the time domain using only these components (X') relative to the original data (X). I would like to slightly modify the above figure such that all of the lines shown are overlaid on a single plot.
I have been trying to adapt the figure with some real data sampled at 60 Hz. For example:
## 3 second sample where: time is in seconds and var is the variable of interest
temp = data.frame(time=seq(from=0,to=3,by=1/60),
var = c(0.054,0.054,0.054,0.072,0.072,0.072,0.072,0.09,0.09,0.108,0.126,0.126,
0.126,0.126,0.126,0.144,0.144,0.144,0.144,0.144,0.162,0.162,0.144,0.126,
0.126,0.108,0.144,0.162,0.18,0.162,0.126,0.126,0.108,0.108,0.126,0.144,
0.162,0.144,0.144,0.144,0.144,0.162,0.162,0.126,0.108,0.09,0.09,0.072,
0.054,0.054,0.054,0.036,0.036,0.018,0.018,0.018,0.018,0,0.018,0,
0,0,-0.018,0,0,0,-0.018,0,-0.018,-0.018,0,-0.018,
-0.018,-0.018,-0.018,-0.036,-0.036,-0.054,-0.054,-0.072,-0.072,-0.072,-0.072,-0.072,
-0.09,-0.09,-0.108,-0.126,-0.126,-0.126,-0.144,-0.144,-0.144,-0.162,-0.162,-0.18,
-0.162,-0.162,-0.162,-0.162,-0.144,-0.144,-0.144,-0.126,-0.126,-0.108,-0.108,-0.09,
-0.072,-0.054,-0.036,-0.018,0,0,0,0,0.018,0.018,0.036,0.054,
0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.072,0.054,0.072,
0.072,0.072,0.072,0.072,0.072,0.054,0.054,0.054,0.036,0.036,0.036,0.036,
0.036,0.054,0.054,0.072,0.09,0.072,0.036,0.036,0.018,0.018,0.018,0.018,
0.036,0.036,0.036,0.036,0.018,0,-0.018,-0.018,-0.018,-0.018,-0.018,0,
-0.018,-0.036,-0.036,-0.018,-0.018,-0.018,-0.036,0,0,-0.018,-0.018,-0.018,-0.018))
##plot the original data
ggplot(temp, aes(x=time, y=var))+geom_line()
I believe that I can use fft() to eventually accomplish this goal however the leap from the output of fft() to my goal is a bit unclear.
I realize that this question is somewhat similar to: How do I calculate amplitude and phase angle of fft() output from real-valued input? but I am more specifically interested in the actual code for the specific data above.
Please note that I am relatively new to time series analysis so any clarity you could provide w.r.t. putting the output of fft() in context, or any package you could recommend that would accomplish this task efficiently would be appreciated.
Thank you
Matlab is your best tool, and the specific function is just fft(). To use it, first determine several basic parameters of your time domain data:
1, time duration (T), which equals to 3s.
2, Sampling interval T_s, which equals to 1/60 s.
3, Frequency domain revolution f_s, which equals to the frequency difference between two adjacent Fourier basis. You may define f_s according to your needs. However, the smallest possible f_s equals to 1/T=0.333 Hz. As a result, if you want better frequency domain revolution (smaller f_s), you need longer time domain data.
4, Maximum frequency f_M, which equals to 1/(2T_s)=30 according to Shannon sampling theory.
5, DFT length N, which equals to 2*f_M/f_s.
Then find out the specific frequencies of four Fourier basis that you want to use to approximate the data. For example, 3,6,9 and 12 Hz. So f_s = 3 Hz. Then N=2*f_M/f_s=20.
Your Matlab code looks like this:
var=[0.054,0.054,0.054 ...]; % input all your data points here
f_full=fft(var,20); % Do 20-point fft
f_useful=f_full(2:5); % You are interested with the lowest four frequencies except DC
Here f_useful contains the four complex coefficients of four Fourier basis. To reconstruct var, do the following:
% Generate basis functions
dt=0:1/60:3;
df=[3:3:12];
basis1=exp(1j*2*pi*df(1)*dt);
basis2=exp(1j*2*pi*df(2)*dt);
basis3=exp(1j*2*pi*df(3)*dt);
basis4=exp(1j*2*pi*df(4)*dt);
% Reconstruct var
var_recon=basis1*f_useful(1)+...
basis2*f_useful(2)+...
basis3*f_useful(3)+...
basis4*f_useful(4);
var_recon=real(var_recon);
% Plot both curves
figure;
plot(var);
hold on;
plot(var_recon);
Adapt this code to your paper :)
Adapting my own post from Signal Processing. I think it's still relevant for those in Python.
I am no expert in this topic, but have some useful examples to share.
The more Fourier components you keep, the closer you'll mimic the original signal.
This example shows what happens when you keep 10, 20, ...up to n components. Assuming x and y are your data vectors.
import numpy
from matplotlib import pyplot as plt
n = len(y)
COMPONENTS = [10, 20, n]
for c in COMPONENTS:
colors = numpy.linspace(start=100, stop=255, num=c)
for i in range(c):
Y = numpy.fft.fft(y)
numpy.put(Y, range(i+1, n), 0.0)
ifft = numpy.fft.ifft(Y)
plt.plot(x, ifft, color=plt.cm.Reds(int(colors[i])), alpha=.70)
plt.title("First {c} fourier components".format(c=c))
plt.plot(x,y, label="Original dataset", linewidth=2.0)
plt.grid(linestyle='dashed')
plt.legend()
plt.show()
For the book's dataset, keeping up to 4, 10, and n components:
For your dataset, keeping up to 4, 10, and n components:
I have two sequences of length n and m. Each is a sequence of points of the form (x,y) and represent curves in an image. I need to find how different (or similar) these sequences are given that fact that
one sequence is likely longer than the other (i.e., one can be half or a quarter as long as the other, but if they trace approximately the same curve, they are the same)
these sequences could be in opposite directions (i.e., sequence 1 goes from left to right, while sequence 2 goes from right to left)
I looked into some difference estimates like Levenshtein as well as edit-distances in structural similarity matching for protein folding, but none of them seem to do the trick. I could write my own brute-force method but I want to know if there is a better way.
Thanks.
Do you mean that you are trying to match curves that have been translated in x,y coordinates? One technique from image processing is to use chain codes [I'm looking for a decent reference, but all I can find right now is this] to encode each sequence and then compare those chain codes. You could take the sum of the differences (modulo 8) and if the result is 0, the curves are identical. Since the sequences are of different lengths and don't necessarily start at the same relative location, you would have to shift one sequence and do this again and again, but you only have to create the chain codes once. The only way to detect if one of the sequences is reversed is to try both the forward and reverse of one of the sequences. If the curves aren't exactly alike, the sum will be greater than zero but it is not straightforward to tell how different the curves are simply from the sum.
This method will not be rotationally invariant. If you need a method that is rotationally invariant, you should look at Boundary-Centered Polar Encoding. I can't find a free reference for that, but if you need me to describe it, let me know.
A method along these lines might work:
For both sequences:
Fit a curve through the sequence. Make sure that you have a continuous one-to-one function from [0,1] to points on this curve. That is, for each (real) number between 0 and 1, this function returns a point on the curve belonging to it. By tracing the function for all numbers from 0 to 1, you get the entire curve.
One way to fit a curve would be to draw a straight line between each pair of consecutive points (it is not a nice curve, because it has sharp bends, but it might be fine for your purpose). In that case, the function can be obtained by calculating the total length of all the line segments (Pythagoras). The point on the curve corresponding to a number Y (between 0 and 1) corresponds to the point on the curve that has a distance Y * (total length of all line segments) from the first point on the sequence, measured by traveling over the line segments (!!).
Now, after we have obtained such a function F(double) for the first sequence, and G(double) for the second sequence, we can calculate the similarity as follows:
double epsilon = 0.01;
double curveDistanceSquared = 0.0;
for(double d=0.0;d<1.0;d=d+epsilon)
{
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(d);
//alternatively, use G(1.0-d) to check whether the second sequence is reversed
double distanceOfPoints = pointOnCurve1.EuclideanDistance(pointOnCurve2);
curveDistanceSquared = curveDistanceSquared + distanceOfPoints * distanceOfPoints;
}
similarity = 1.0/ curveDistanceSquared;
Possible improvements:
-Find an improved way to fit the curves. Note that you still need the function that traces the curve for the above method to work.
-When calculating the distance, consider reparametrizing the function G in such a way that the distance is minimized. (This means you have an increasing function R, such that R(0) = 0 and R(1)=1,
but which is otherwise general. When calculating the distance you use
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(R(d));
Subsequently, you try to choose R in such a way that the distance is minimized. (to see what happens, note that G(R(d)) also traces the curve)).
Why not do some sort of curve fitting procedure (least-squares whether it be ordinary or non-linear) and see if the coefficients on the shape parameters are the same. If you run it as a panel-data sort of model, there are explicit statistical tests whether sets of parameters are significantly different from one another. That would solve the problem of the the same curve but sampled at different resolutions.
Step 1: Canonicalize the orientation. For example, let's say that all curved start at the endpoint with lowest lexicographic order.
def inCanonicalOrientation(path):
return path if path[0]<path[-1] else reversed(path)
Step 2: You can either be roughly accurate, or very accurate. If you wish to be very accurate, calculate a spline, or fit both curves to a polynomial of appropriate degree, and compare coefficients. If you'd like just a rough estimate, do as follows:
def resample(path, numPoints)
pathLength = pathLength(path) #write this function
segments = generateSegments(path)
currentSegment = next(segments)
segmentsSoFar = [currentSegment]
for i in range(numPoints):
samplePosition = i/(numPoints-1)*pathLength
while samplePosition > pathLength(segmentsSoFar)+currentSegment.length:
currentSegment = next(segments)
segmentsSoFar.insert(currentSegment)
difference = samplePosition - pathLength(segmentsSoFar)
howFar = difference/currentSegment.length
yield Point((1-howFar)*currentSegment.start + (howFar)*currentSegment.end)
This can be modified from a linear resampling to something better.
def error(pathA, pathB):
pathA = inCanonicalOrientation(pathA)
pathB = inCanonicalOrientation(pathB)
higherResolution = max([len(pathA), len(pathB)])
resampledA = resample(pathA, higherResolution)
resampledB = resample(pathA, higherResolution)
error = sum(
abs(pointInA-pointInB)
for pointInA,pointInB in zip(pathA,pathB)
)
averageError = error / len(pathAorB)
normalizedError = error / Z(AorB)
return normalizedError
Where Z is something like the "diameter" of your path, perhaps the maximum Euclidean distance between any two points in a path.
I would use a curve-fitting procedure, but also throw in a constant term, i.e. 0 =B0 + B1*X + B2*Y + B3*X*Y + B4*X^2 etc. This would catch the translational variance and then you can do a statistical comparison of the estimated coefficients of the curves formed by the two sets of points as a way of classifying them. I'm assuming you'll have to do bi-variate interpolation if the data form arbitrary curves in the x-y plane.
I am trying to plot large amounts of points using some library. The points are ordered by time and their values can be considered unpredictable.
My problem at the moment is that the sheer number of points makes the library take too long to render. Many of the points are redundant (that is - they are "on" the same line as defined by a function y = ax + b). Is there a way to detect and remove redundant points in order to speed rendering ?
Thank you for your time.
The following is a variation on the Ramer-Douglas-Peucker algorithm for 1.5d graphs:
Compute the line equation between first and last point
Check all other points to find what is the most distant from the line
If the worst point is below the tolerance you want then output a single segment
Otherwise call recursively considering two sub-arrays, using the worst point as splitter
In python this could be
def simplify(pts, eps):
if len(pts) < 3:
return pts
x0, y0 = pts[0]
x1, y1 = pts[-1]
m = float(y1 - y0) / float(x1 - x0)
q = y0 - m*x0
worst_err = -1
worst_index = -1
for i in xrange(1, len(pts) - 1):
x, y = pts[i]
err = abs(m*x + q - y)
if err > worst_err:
worst_err = err
worst_index = i
if worst_err < eps:
return [(x0, y0), (x1, y1)]
else:
first = simplify(pts[:worst_index+1], eps)
second = simplify(pts[worst_index:], eps)
return first + second[1:]
print simplify([(0,0), (10,10), (20,20), (30,30), (50,0)], 0.1)
The output is [(0, 0), (30, 30), (50, 0)].
About python syntax for arrays that may be non obvious:
x[a:b] is the part of array from index a up to index b (excluded)
x[n:] is the array made using elements of x from index n to the end
x[:n] is the array made using first n elements of x
a+b when a and b are arrays means concatenation
x[-1] is the last element of an array
An example of the results of running this implementation on a graph with 100,000 points with increasing values of eps can be seen here.
I came across this question after I had this very idea. Skip redundant points on plots. I believe I came up with a far better and simpler solution and I'm happy to share as my first proposed solution on SO. I've coded it and it works well for me. It also takes into account the screen scale. There may be 100 points in value between those plot points, but if the user has a chart sized small, they won't see them.
So, iterating through your data/plot loop, before you draw/add your next data point, look at the next value ahead and calculate the change in screen scale (or value, but I think screen scale for the above-mentioned reason is better). Now do the same for the next value ahead (getting these values is just a matter of peeking ahead in your array/collection/list/etc adding the for next step increment (probably 1/2) to the current for value whilst in the loop). If the 2 values are the same (or perhaps very minor change, per your own preference), you can skip this one point in your chart by simply adding 'continue' in the loop, skipping adding the data point as the point lies exactly on the slope between the point before and after it.
Using this method, I reduce a chart from 963 points to 427 for example, with absolutely zero visual change.
I think you might need to perhaps read this a couple of times to understand, but it's far simpler than the other best solution mentioned here, much lighter weight, and has zero visual effect on your plot.
I would probably apply a "least squares" algorithm to obtain a line of best fit. You can then go through your points and downfilter consecutive points that lie close to the line. You only need to plot the outliers, and the points that take the curve back to the line of best fit.
Edit: You may not need to employ "least squares"; if your input is expected to hover around "y=ax+b" as you say, then that's already your line of best fit and you can just use that. :)
I have been given some work to do with the fractal visualisation of the Mandelbrot set.
I'm not looking for a complete solution (naturally), I'm asking for help with regard to the orbits of complex numbers.
Say I have a given Complex number derived from a point on the complex plane. I now need to iterate over its orbit sequence and plot points according to whether the orbits increase by orders of magnitude or not.
How do I gather the orbits of a complex number? Any guidance is much appreciated (links etc). Any pointers on Math functions needed to test the orbit sequence e.g. Math.pow()
I'm using Java but that's not particularly relevant here.
Thanks again,
Alex
When you display the Mandelbrot set, you simply translate the real and imaginaty planes into x and y coordinates, respectively.
So, for example the complex number 4.5 + 0.27i translates into x = 4.5, y = 0.27.
The Mandelbrot set is all points where the equation Z = Z² + C never reaches a value where |Z| >= 2, but in practice you include all points where the value doesn't exceed 2 within a specific number of iterations, for example 1000. To get the colorful renderings that you usually see of the set, you assign different colors to points outside the set depending on how fast they reach the limit.
As it's complex numbers, the equation is actually Zr + Zi = (Zr + Zi)² + Cr + Ci. You would divide that into two equations, one for the real plane and one for the imaginary plane, and then it's just plain algebra. C is the coordinate of the point that you want to test, and the initial value of Z is zero.
Here's an image from my multi-threaded Mandelbrot generator :)
Actually the Mandelbrot set is the set of complex numbers for which the iteration converges.
So the only points in the Mandelbrot set are that big boring colour in the middle. and all of the pretty colours you see are doing nothing more than representing the rate at which points near the boundary (but the wrong side) spin off to infinity.
In mathspeak,
M = {c in C : lim (k -> inf) z_k = 0 } where z_0 = c, z_(k+1) = z_k^2 + c
ie pick any complex number c. Now to determine whether it is in the set, repeatedly iterate it z_0 = c, z_(k+1) = z_k^2 + c, and z_k will approach either zero or infinity. If its limit (as k tends to infinity) is zero, then it is in. Otherwise not.
It is possible to prove that once |z_k| > 2, it is not going to converge. This is a good exercise in optimisation: IIRC |Z_k|^2 > 2 is sufficient... either way, squaring up will save you the expensive sqrt() function.
Wolfram Mathworld has a nice site talking about the Mandelbrot set.
A Complex class will be most helpful.
Maybe an example like this will stimulate some thought. I wouldn't recommend using an Applet.
You have to know how to do add, subtract, multiply, divide, and power operations with complex numbers, in addition to functions like sine, cosine, exponential, etc. If you don't know those, I'd start there.
The book that I was taught from was Ruel V. Churchill "Complex Variables".
/d{def}def/u{dup}d[0 -185 u 0 300 u]concat/q 5e-3 d/m{mul}d/z{A u m B u
m}d/r{rlineto}d/X -2 q 1{d/Y -2 q 2{d/A 0 d/B 0 d 64 -1 1{/f exch d/B
A/A z sub X add d B 2 m m Y add d z add 4 gt{exit}if/f 64 d}for f 64 div
setgray X Y moveto 0 q neg u 0 0 q u 0 r r r r fill/Y}for/X}for showpage