Reconstructing a signal from its discrete fourier transform in R - r

I am trying to replicate the following figure in R: (adapted from http://link.springer.com/article/10.1007/PL00011669)
The basic concept of the figure is to show the first few components of a DFT, plotted in the time domain, and then show a reconstructed wave in the time domain using only these components (X') relative to the original data (X). I would like to slightly modify the above figure such that all of the lines shown are overlaid on a single plot.
I have been trying to adapt the figure with some real data sampled at 60 Hz. For example:
## 3 second sample where: time is in seconds and var is the variable of interest
temp = data.frame(time=seq(from=0,to=3,by=1/60),
var = c(0.054,0.054,0.054,0.072,0.072,0.072,0.072,0.09,0.09,0.108,0.126,0.126,
0.126,0.126,0.126,0.144,0.144,0.144,0.144,0.144,0.162,0.162,0.144,0.126,
0.126,0.108,0.144,0.162,0.18,0.162,0.126,0.126,0.108,0.108,0.126,0.144,
0.162,0.144,0.144,0.144,0.144,0.162,0.162,0.126,0.108,0.09,0.09,0.072,
0.054,0.054,0.054,0.036,0.036,0.018,0.018,0.018,0.018,0,0.018,0,
0,0,-0.018,0,0,0,-0.018,0,-0.018,-0.018,0,-0.018,
-0.018,-0.018,-0.018,-0.036,-0.036,-0.054,-0.054,-0.072,-0.072,-0.072,-0.072,-0.072,
-0.09,-0.09,-0.108,-0.126,-0.126,-0.126,-0.144,-0.144,-0.144,-0.162,-0.162,-0.18,
-0.162,-0.162,-0.162,-0.162,-0.144,-0.144,-0.144,-0.126,-0.126,-0.108,-0.108,-0.09,
-0.072,-0.054,-0.036,-0.018,0,0,0,0,0.018,0.018,0.036,0.054,
0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.072,0.054,0.072,
0.072,0.072,0.072,0.072,0.072,0.054,0.054,0.054,0.036,0.036,0.036,0.036,
0.036,0.054,0.054,0.072,0.09,0.072,0.036,0.036,0.018,0.018,0.018,0.018,
0.036,0.036,0.036,0.036,0.018,0,-0.018,-0.018,-0.018,-0.018,-0.018,0,
-0.018,-0.036,-0.036,-0.018,-0.018,-0.018,-0.036,0,0,-0.018,-0.018,-0.018,-0.018))
##plot the original data
ggplot(temp, aes(x=time, y=var))+geom_line()
I believe that I can use fft() to eventually accomplish this goal however the leap from the output of fft() to my goal is a bit unclear.
I realize that this question is somewhat similar to: How do I calculate amplitude and phase angle of fft() output from real-valued input? but I am more specifically interested in the actual code for the specific data above.
Please note that I am relatively new to time series analysis so any clarity you could provide w.r.t. putting the output of fft() in context, or any package you could recommend that would accomplish this task efficiently would be appreciated.
Thank you

Matlab is your best tool, and the specific function is just fft(). To use it, first determine several basic parameters of your time domain data:
1, time duration (T), which equals to 3s.
2, Sampling interval T_s, which equals to 1/60 s.
3, Frequency domain revolution f_s, which equals to the frequency difference between two adjacent Fourier basis. You may define f_s according to your needs. However, the smallest possible f_s equals to 1/T=0.333 Hz. As a result, if you want better frequency domain revolution (smaller f_s), you need longer time domain data.
4, Maximum frequency f_M, which equals to 1/(2T_s)=30 according to Shannon sampling theory.
5, DFT length N, which equals to 2*f_M/f_s.
Then find out the specific frequencies of four Fourier basis that you want to use to approximate the data. For example, 3,6,9 and 12 Hz. So f_s = 3 Hz. Then N=2*f_M/f_s=20.
Your Matlab code looks like this:
var=[0.054,0.054,0.054 ...]; % input all your data points here
f_full=fft(var,20); % Do 20-point fft
f_useful=f_full(2:5); % You are interested with the lowest four frequencies except DC
Here f_useful contains the four complex coefficients of four Fourier basis. To reconstruct var, do the following:
% Generate basis functions
dt=0:1/60:3;
df=[3:3:12];
basis1=exp(1j*2*pi*df(1)*dt);
basis2=exp(1j*2*pi*df(2)*dt);
basis3=exp(1j*2*pi*df(3)*dt);
basis4=exp(1j*2*pi*df(4)*dt);
% Reconstruct var
var_recon=basis1*f_useful(1)+...
basis2*f_useful(2)+...
basis3*f_useful(3)+...
basis4*f_useful(4);
var_recon=real(var_recon);
% Plot both curves
figure;
plot(var);
hold on;
plot(var_recon);
Adapt this code to your paper :)

Adapting my own post from Signal Processing. I think it's still relevant for those in Python.
I am no expert in this topic, but have some useful examples to share.
The more Fourier components you keep, the closer you'll mimic the original signal.
This example shows what happens when you keep 10, 20, ...up to n components. Assuming x and y are your data vectors.
import numpy
from matplotlib import pyplot as plt
n = len(y)
COMPONENTS = [10, 20, n]
for c in COMPONENTS:
colors = numpy.linspace(start=100, stop=255, num=c)
for i in range(c):
Y = numpy.fft.fft(y)
numpy.put(Y, range(i+1, n), 0.0)
ifft = numpy.fft.ifft(Y)
plt.plot(x, ifft, color=plt.cm.Reds(int(colors[i])), alpha=.70)
plt.title("First {c} fourier components".format(c=c))
plt.plot(x,y, label="Original dataset", linewidth=2.0)
plt.grid(linestyle='dashed')
plt.legend()
plt.show()
For the book's dataset, keeping up to 4, 10, and n components:
For your dataset, keeping up to 4, 10, and n components:

Related

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

Why the need for a mask when performing Fast Fourier Transform?

I'm trying to find out the peak frequencies hidden in my data using the fft() method in R. While preparing the data, a more experienced user recommends to create a "mask" (more after explaining the details), that does give me the exact diagram I'm looking for. The problem is, I don't understand what it does or why it's needed.
To give some context, I'm working with .txt files with around 12000 entries each. It's voltage vs. time information, and the expected result is just a sinusoidal wave with a clear peak frequency that should be close to 1-2 Hz. This is an example of what one of those files look like:
I've been trying to use the Fast Fourier Transform method fft() implemented in R to find the peak frequencies and get a diagram that reflected them clearly. At first, I calculate some things that I understand are going to be useful, like the Nyquist frequency and the range of frequencies I'll show in the final graph:
n = length(variable)
dt = time[5]-time[4]
df = 1/(max(time)) #Find out the "unit" frequency
fnyquist = 1/(2*dt) #The Nyquist frequency
f = seq(-fnyquist, fnyquist-df, by=df) #These are the frequencies I'll plot
But when I plot the absolute value of what fft(data) calculates vs. the range of frequencies, I get this:
The peak frequency seems to be close to 50 Hz, but I know that's not the case. It should be close to 1 Hz. I'm a complete newbie in R and in Fourier analysis, so after researching a little, I found in a Swiss page that this can be solved by creating a "mask", which is actually just a vector with a repeatting patern (1, -1, 1, -1...) with the same length as my data vector itself:
mask=rep(c(1, -1),length.out=n)
Then if I multiply my data vector by this mask and plot the results:
results = mask*data
plot(f,abs(fft(results)),type="h")
I get what I was looking for. (This is the graph after limiting the x-axis to a reasonable scale).
So, what's the mask actually doing? I undestand it's changing my data point signs in an alternate manner, but I don't get why it would take the infered peak frequencies from ~50 Hz to the correct result of ~1 Hz.
Thanks in advance!
Your "mask" is one of two methods of performing an fftshift, which is commonly done to center the 0 Hz output of an FFT in the middle of a graph or plot (instead of at the left edge, with the negative frequencies wrapping around to the right edge).
To perform an fftshift, you can hetrodyne or modulate your data (by Fs/2) before the FFT, or simply do a circular shift by 50% after the FFT. Both produce the same result. They are the same due to the shift property of the DFT.

Dimensions of fractals: boxing count, hausdorff, packing in R^n space

I would like to calculate dimensions of fractal written as a n-dimensional array of 0s and 1s. It includes boxing count, hausdorff and packing dimension.
I have only idea how to code boxing count dimensions (just counting 1's in n-dimensional matrix and then use this formula:
boxing_count=-log(v)/log(n);
where n-number of 1's and n-space dimension (R^n)
This approach simulate counting minimal resolution boxes 1 x 1 x ... x 1 so numerical it is like limit eps->0. What do you think about this solution?
Do you have any idea (or maybe code) for calculating hausdorff or packing dimension?
The Hausdorff and packing dimension are purely mathematical tools based in measure theory. They have wonderful properties in that context but are not well suited for experimentation. In short, there is no reason to expect that you can estimate their values based on a single matrix approximation to some set.
Box counting dimension, by contrast, is well suited for numerical investigation. Specifically, let N(e) denote the number of squares of side length e required to cover your fractal set. As you seem to know, the box counting dimension of your set is the limit as e->0 of
log(N(e))/log(1/e)
However, I don't think that just choosing the smallest available value of e is generally a good idea. The standard interpretation in the physics literature, as I understand it, is to presume that the relationship between N(e) and e should be maintained over a broad range of values. A standard way to compute the box-counting dimension is compute N(e) for some choices of e chosen from a sequence that tends geometrically to zero. We then fit a line to the points in a log-log plot of N(e) versus 1/e The box-counting dimension should be approximately the slope of that line.
Example
As a concrete example, the following Python code generates a binary matrix that describes a fractal structure.
import numpy as np
size = 1024
first_row = np.zeros(size, dtype=int)
first_row[int(size/2)-1] = 1
rows = np.zeros((int(size/2),size),dtype=int)
rows[0] = first_row
for i in range(1,int(size/2)):
rows[i] = (np.roll(rows[i-1],-1) + rows[i-1] + np.roll(rows[i-1],1)) % 2
m = int(np.log(size)/np.log(2))
rows = rows[0:2**(m-1),0:2**m]
We can view the fractal structure by simply interpreting each 1 as a black pixel and each zero as white pixel.
import matplotlib.pyplot as plt
plt.matshow(rows, cmap = plt.cm.binary)
This matrix makes a nice test since it can be shown that there is an actual limiting object whose fractal dimension is log(1+sqrt(5))/log(2) or approximately 1.694, yet it's complicated enough to make the box counting estimate a little tricky.
Now, this matrix is 512 rows by 1024 columns; it decomposes naturally into 2 matrices that are 512 by 512. Each of those decomposes naturally into 4 matrices that are 256 by 256, etc. For each such decomposition, we need to count the number of sub matrices that have at least one non-zero element. We can perform this analysis as follows:
cnts = []
for lev in range(m):
block_size = 2**lev
cnt = 0
for j in range(int(size/(2*block_size))):
for i in range(int(size/block_size)):
cnt = cnt + rows[j*block_size:(j+1)*block_size, i*block_size:(i+1)*block_size].any()
cnts.append(cnt)
data = np.array([(2**(m-(k+1)),cnts[k]) for k in range(m)])
data
# Out:
# array([[ 512, 45568],
# [ 256, 22784],
# [ 128, 7040],
# [ 64, 2176],
# [ 32, 672],
# [ 16, 208],
# [ 8, 64],
# [ 4, 20],
# [ 2, 6],
# [ 1, 2]])
Now, your idea is to simply compute log(45568)/log(512) or approximately 1.7195, which is not too bad. I'm recommending that we examine a log-log plot of this data.
xs = np.log(data[:,0])
ys = np.log(data[:,1])
plt.plot(xs,ys, 'o')
This indeed looks close to linear, indicating that we might expect our box-counting technique to work reasonably well. First, though, it might be reasonable to exclude the one point that appears to be an outlier. In fact, that's one of the desirable characteristics of this approach. Here's how to do so:
plt.plot(xs,ys, 'o')
xs = xs[1:]
ys = ys[1:]
A = np.vstack([xs, np.ones(len(xs))]).T
m,b = np.linalg.lstsq(A, ys)[0]
def line(x): return m*x+b
ys = line(xs)
plt.plot(xs,ys)
m
# Out: 1.6902585379630133
Well, the result looks pretty good. In particular, this is a definitive example that this approach can work better than the simple idea of using just one data point. In fairness, though, it's not hard to find examples where the simple approach works better. Also, this set is regular enough that we get some nice results. Generally, one can't really expect box-counting computations to be too reliable.

Why do I get two frequency spikes from a simple sin function via FFT in R?

I learned about fourier transformation in mathematics classes and thought I had understood them. Now, I am trying to play around with R (statistical language) and interpret the results of a discrete FFT in practice. This is what I have done:
x = seq(0,1,by=0.1)
y = sin(2*pi*(x))
calcenergy <- function(x) Im(x) * Im(x) + Re(x) * Re(x)
fy <- fft(y)
plot(x, calcenergy(fy))
and get this plot:
If I understand this right, this represents the 'half' of the energy density spectrum. As the transformation is symmetric, I could just mirror all values to the negative values of x to get the full spectrum.
However, what I dont understand is, why I am getting two spikes? There is only a single sinus frequency in here. Is this an aliasing effect?
Also, I have no clue how to get the frequencies out of this plot. Lets assume the units of the sinus function were seconds, is the peak at 1.0 in the density spectrum 1Hz then?
Again: I understand the theory behind FFT; the practical application is the problem :).
Thanks for any help!
For a purely real input signal of N points you get a complex output of N points with complex conjugate symmetry about N/2. You can ignore the output points above N/2, since they provide no useful additional information for a real input signal, but if you do plot them you will see the aforementioned symmetry, and for a single sine wave you will see peaks at bins n and N - n. (Note: you can think of the upper N/2 bins as representing negative frequencies.) In summary, for a real input signal of N points, you get N/2 useful complex output bins from the FFT, which represent frequencies from DC (0 Hz) to Nyquist (Fs / 2).
To get frequencies from the result of an FFT you need to know the sample rate of the data that was input to the FFT and the length of the FFT. The center frequency of each bin is the bin index times the sample rate divided by the length of the FFT. Thus you will get frequencies from DC (0 Hz) to Fs/2 at the halfway bin.
The second half of the FFT results are just complex conjugates of the first for real data inputs. The reason is that the imaginary portions of complex conjugates cancel, which is required to represent a summed result with zero imaginary content, e.g. strictly real.

How do I calculate the "difference" between two sequences of points?

I have two sequences of length n and m. Each is a sequence of points of the form (x,y) and represent curves in an image. I need to find how different (or similar) these sequences are given that fact that
one sequence is likely longer than the other (i.e., one can be half or a quarter as long as the other, but if they trace approximately the same curve, they are the same)
these sequences could be in opposite directions (i.e., sequence 1 goes from left to right, while sequence 2 goes from right to left)
I looked into some difference estimates like Levenshtein as well as edit-distances in structural similarity matching for protein folding, but none of them seem to do the trick. I could write my own brute-force method but I want to know if there is a better way.
Thanks.
Do you mean that you are trying to match curves that have been translated in x,y coordinates? One technique from image processing is to use chain codes [I'm looking for a decent reference, but all I can find right now is this] to encode each sequence and then compare those chain codes. You could take the sum of the differences (modulo 8) and if the result is 0, the curves are identical. Since the sequences are of different lengths and don't necessarily start at the same relative location, you would have to shift one sequence and do this again and again, but you only have to create the chain codes once. The only way to detect if one of the sequences is reversed is to try both the forward and reverse of one of the sequences. If the curves aren't exactly alike, the sum will be greater than zero but it is not straightforward to tell how different the curves are simply from the sum.
This method will not be rotationally invariant. If you need a method that is rotationally invariant, you should look at Boundary-Centered Polar Encoding. I can't find a free reference for that, but if you need me to describe it, let me know.
A method along these lines might work:
For both sequences:
Fit a curve through the sequence. Make sure that you have a continuous one-to-one function from [0,1] to points on this curve. That is, for each (real) number between 0 and 1, this function returns a point on the curve belonging to it. By tracing the function for all numbers from 0 to 1, you get the entire curve.
One way to fit a curve would be to draw a straight line between each pair of consecutive points (it is not a nice curve, because it has sharp bends, but it might be fine for your purpose). In that case, the function can be obtained by calculating the total length of all the line segments (Pythagoras). The point on the curve corresponding to a number Y (between 0 and 1) corresponds to the point on the curve that has a distance Y * (total length of all line segments) from the first point on the sequence, measured by traveling over the line segments (!!).
Now, after we have obtained such a function F(double) for the first sequence, and G(double) for the second sequence, we can calculate the similarity as follows:
double epsilon = 0.01;
double curveDistanceSquared = 0.0;
for(double d=0.0;d<1.0;d=d+epsilon)
{
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(d);
//alternatively, use G(1.0-d) to check whether the second sequence is reversed
double distanceOfPoints = pointOnCurve1.EuclideanDistance(pointOnCurve2);
curveDistanceSquared = curveDistanceSquared + distanceOfPoints * distanceOfPoints;
}
similarity = 1.0/ curveDistanceSquared;
Possible improvements:
-Find an improved way to fit the curves. Note that you still need the function that traces the curve for the above method to work.
-When calculating the distance, consider reparametrizing the function G in such a way that the distance is minimized. (This means you have an increasing function R, such that R(0) = 0 and R(1)=1,
but which is otherwise general. When calculating the distance you use
Point pointOnCurve1 = F(d);
Point pointOnCurve2 = G(R(d));
Subsequently, you try to choose R in such a way that the distance is minimized. (to see what happens, note that G(R(d)) also traces the curve)).
Why not do some sort of curve fitting procedure (least-squares whether it be ordinary or non-linear) and see if the coefficients on the shape parameters are the same. If you run it as a panel-data sort of model, there are explicit statistical tests whether sets of parameters are significantly different from one another. That would solve the problem of the the same curve but sampled at different resolutions.
Step 1: Canonicalize the orientation. For example, let's say that all curved start at the endpoint with lowest lexicographic order.
def inCanonicalOrientation(path):
return path if path[0]<path[-1] else reversed(path)
Step 2: You can either be roughly accurate, or very accurate. If you wish to be very accurate, calculate a spline, or fit both curves to a polynomial of appropriate degree, and compare coefficients. If you'd like just a rough estimate, do as follows:
def resample(path, numPoints)
pathLength = pathLength(path) #write this function
segments = generateSegments(path)
currentSegment = next(segments)
segmentsSoFar = [currentSegment]
for i in range(numPoints):
samplePosition = i/(numPoints-1)*pathLength
while samplePosition > pathLength(segmentsSoFar)+currentSegment.length:
currentSegment = next(segments)
segmentsSoFar.insert(currentSegment)
difference = samplePosition - pathLength(segmentsSoFar)
howFar = difference/currentSegment.length
yield Point((1-howFar)*currentSegment.start + (howFar)*currentSegment.end)
This can be modified from a linear resampling to something better.
def error(pathA, pathB):
pathA = inCanonicalOrientation(pathA)
pathB = inCanonicalOrientation(pathB)
higherResolution = max([len(pathA), len(pathB)])
resampledA = resample(pathA, higherResolution)
resampledB = resample(pathA, higherResolution)
error = sum(
abs(pointInA-pointInB)
for pointInA,pointInB in zip(pathA,pathB)
)
averageError = error / len(pathAorB)
normalizedError = error / Z(AorB)
return normalizedError
Where Z is something like the "diameter" of your path, perhaps the maximum Euclidean distance between any two points in a path.
I would use a curve-fitting procedure, but also throw in a constant term, i.e. 0 =B0 + B1*X + B2*Y + B3*X*Y + B4*X^2 etc. This would catch the translational variance and then you can do a statistical comparison of the estimated coefficients of the curves formed by the two sets of points as a way of classifying them. I'm assuming you'll have to do bi-variate interpolation if the data form arbitrary curves in the x-y plane.

Resources