What does symmetry mean when plotting Real vs Imaginary Components of FFT of a Periodic Time Series - math

As the subject says, What does the symmetry mean when graphing real vs imaginary components of a FFT? And does the clustering imply periodicity in the data?
I ask this because I did a project with predicting sunspot count with a neural network and had to find the periodicity of the data (and used FFT which worked).
Someone recommended I look at graphing the real vs imaginary components, but I don't understand what I am looking at.

The Fourier transform of any real-valued signal will have Hermitian symmetry, meaning the transform values of the positive frequencies and negative frequencies will be complex conjugates of each other. Therefore the real values are the same, and the imaginary values are negatives of each other, as your picture shows.
It would probably be more interesting to drop the negative frequencies and do your graph again.
For your second question, your result appears to be clustered around 0,0, so no, the clustering does not imply periodicity. Large values in the transform imply periodicity, at the related frequency.
However, you have two large components, one primarily real and one primarily imaginary. Another way of thinking of "real in the frequency domain" is "like a cosine in the time domain", while "imaginary in the frequency domain" is "like a sine in the time domain." Your data set probably doesn't start exactly on a sunspot cycle, so the cycle looks like the combination of a sine and cosine. If you slide the data set, the relative amplitudes of the real and imaginary parts will probably change.
I had earlier suggested that the phase difference might imply differing activity in summer and winter, but that would show up as a component at twice the base frequency.

Related

COMSOL: Diffusion in Transport of Diluted Species Produces Unphysical Results

I am simulating Transport of Diluted Species inside a pipe segment in COMSOL Multiphysics. I have specified an initial concentration which produces a concentration distribution around a slice through the pipe at t=0. Moreover, I have a point probe a little bit upstream (I am using laminar flow for convection). I am plotting the concentration at this point dependent on time.
To investigate whether the model produces accurate (i.e. physically realistic) results, I am varying the diffusion coefficient D. This is where i noticed unrealistic behavior: For a large range of different diffusion coefficients, the concentration graph at the point probe does not change. This is unphysical, since e.g. higher diffusion coefficients should lead to a more spread out distribution at the point probe.
I already did a mesh refinement study and found, that the result strongly depends on mesh resolution. Therefore, I am now using the highest mesh resolution (extremely fine). Regardless, the concentration results still do not change for varying diffusion coefficients.
What could be the reason for this unphysical behavior? I already know it is not due to mesh resolution or relative tolerance of the solver.
After a lot of time spent on this simulation, I concluded that the undesired effects are indeed due to numerical diffusion, as suggested by 2b-t. Of course, it is impossible to be certain that this is actually the reason. However, I investigated pretty much any other potential culprit in the simulation - without any new insights.
To work around this issue of numerical diffusion, I switched to Particle-Based Simulation (PBS) and approximated the concentration as the normalized number of particles inside a small receiver volume. This method provides a good approximation for the concentration for large particle numbers and a small receiver volume.
By doing this, I produced results that are in very good agreement with results know from the literature.

How to sort many time series' by how trending each series is

Hi I am recording data for around 150k items in influx. I have tried grouping by item id and using some of the functions from the docs but they don't seem to show "trend".
As there are a lot of series' to group by. I am currently performing a query on each series to calculate a value, storing it and sorting by that.
I have tried to use Linear Regression (the average angle of the line) but it's not quite meant for this as the X axis are timestamps, which do not correlate to the Y axis values, so end up with a near vertical line. Maybe i can calculate the X values to be something else?
The other issue i have is some series' are much higher values than others, so one series jumping up by 1000 might be huge (very trending) and not a big deal for other series that are always much higher.
Is there a way i can generate a single value from a series that represents how trending the series is, eg its just jumped up quite a lot compared to normal.
Here is an example of one series that is not trending and one that was trending a couple days ago. So the latter would have a higher trend value than the first:
Thanks!
I think similar problems arise naturally in the stock market and in general when detecting outliers.
So there are different way to move. Probably 1 is good enough.
It looks like you have a moving average in the graphs. You could just take the difference to the moving average and see the distribution to evaluate the the appropriate thresholds for you to pay attention. It looks like in the first graph you have an event perhaps relevant. You could just place a threshold like two standard deviations of the average of the difference between the real series and the moving average.
De-trend each series. Even 1) could be good enough (I mean just substraction of real value for the series minus the average for the last X days), you could de-trend using more sophisticated ideas. But that could need more attention for each case, for instance you should be careful with seasonality and so on. Perhaps something line Hodrick Prescott or inline with this: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/.
Perhaps the idea from 1) is more formally described as Bollinger Bands. That help you to know where the time series should be with some probability.
There are more sophisticated ways to identify outliers in time series (as in here: https://towardsdatascience.com/effective-approaches-for-time-series-anomaly-detection-9485b40077f1) or here for a literature review: https://arxiv.org/pdf/2002.04236.pdf

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Correct way to standardize/scale/normalize multiple variables following power law distribution for use in linear combination

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

Resources