Understanding TSA::periodogram() - r

I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.

The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.

Related

How to eliminate zeros in simulated data from rnorm function

I have a large set of high frequency data of wind. I use this data in a model to calculate gas exchange between atmosphere and water. I am using the average wind of a 10-day series of measurements to represent gas exchange at a given time. Since the wind is an average value from a 10-day series I want to apply the error to the output by adding the error to the input:
#fictional time series, manually created by me.
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
I then create 100 values around the mean and sd of the wind vector:
df <- as.data.frame(mapply(rnorm,mean=mean(wind),sd=sd(wind),n=100))
The standard deviation generates negative values. If these are run in the gas exchange model I get disproportionately large error simply because wind speed can't be negative and the model is not constructed to be capable to run with negative wind measurements. I have been suggested to log transform the raw data and run the rnorm() with logged values, and then transform back. But since there are several zeros in the data (0=no wind) I can't simply log the values. Hence I use the log(x+c) method:
wind.log <- log(wind+1)
df.log <- as.data.frame(mapply(rnorm,
mean=mean(wind.log),
sd=sd(wind.log),n=100))
However, I will need to convert values back to actual wind measurements before running them in the model.
This is where it gets problematic, since I will need to use exp(x)-c to convert values back and then I end up with negative values again.
Is there a way to work around this without truncating the 0's and screwing up the generated distribution around the mean?
My only alternative is otherwise is to calculate gas exchange directly at every given time point and generate a distribution from that, those values would never be negative or = 0 and can hence be log-transformed.
Suggestion: use a zero-inflated/altered model, where you generate some proportion of zero values and draw the rest from a log-normal distribution(to make sure you don't get negative values):
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
prop_nonzero <- mean(wind>0)
lmean <- mean(log(wind[wind>0]))
lsd <- sd(log(wind[wind>0]))
n <- 500
vals <- rbinom(n, size=1,prob=prop_nonzero)*rlnorm(n,meanlog=lmean,sdlog=lsd)
Alternatively you could use a Tweedie distribution (as suggested by #aosmith), or fit a censored model to estimate the distribution of wind values that get measured as zero (assuming that the wind speed is never exactly zero, just too small to measure)

Poorly fitting curve in natural log regression

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Periodogram (TSA In R) can't find correct frequency

I'm trying to process a sinusoidal time series data set:
I am using this code in R:
library(readxl)
library(stats)
library(matplot.lib)
library(TSA)
Data_frame<-read_excel("C:/Users/James/Documents/labssin2.xlsx")
# compute the Fourier Transform
p = periodogram(Data_frame$NormalisedVal)
dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 5)
# display the 2 highest "power" frequencies
top2
time = 1/top2$f
time
However when examining the frequency spectrum the frequency (which is in Hz) is ridiculously low ~ 0.02Hz, whereas it should have one much larger frequency of around 1Hz and another smaller one of 0.02Hz (just visually assuming this is a sinusoid enveloped in another sinusoid).
Might be a rather trivial problem, but has anyone got any ideas as to what could be going wrong?
Thanks in advance.
Edit 1: Using
result <- abs(fft(df$Data_frame.NormalisedVal))
Produces what I am expecting to see.
Edit2: As requested, text file with the output to dput(Data_frame).
http://m.uploadedit.com/bbtc/1553266283956.txt
The periodogram function returns normalized frequencies in the [0,0.5] range, where 0.5 corresponds to the Nyquist frequency, i.e. half your sampling rate. Since you appear to have data sampled at 60Hz, the spike at 0.02 would correspond to a frequency of 0.02*60 = 1.2Hz, which is consistent with your expectation and in the neighborhood of what can be seen in the data your provided (the bulk of the spike being in the range of 0.7-1.1Hz).
On the other hand, the x-axis on the last graph you show based on the fft is an index and not a frequency. The corresponding frequency should be computed according to the following formula:
f <- (index-1)*fs/N
where fs is the sampling rate, and N is the number of samples used by the fft. So in your graph the same 1.2Hz would appear at an index of ~31 assuming N is approximately 1500.
Note: the sampling interval in the data you provided is not quite constant and may affect the results as both periodogram and fft assume a regular sampling interval.

Find period with lowest variability in time series r

I have a time series and would like to find the period that has the lowest contiguous variability, i.e. the period in which the rolling SD hovers around the minimum for the longest consecutive time steps.
test=c(10,12,14,16,13,13,14,15,15,14,16,16,16,16,16,16,16,15,14,15,12,11,10)
rol=rollapply(x, width=4, FUN=sd)
rol
I can easily see from the data or the graph that the longest period with the lowest variability start at t=11. Is there a function that can help me find this period of continued low variability, perhaps trying automatically different size for the rolling window? I am not interested in finding the time step with the lowest SD, but a period where this low SD is more consistent than others.
All I can think for now is looking at the difference between rol[i]-rol[i+1], looping through the vector and use a counter to find periods of consecutive low values of SD. I was also thinking of using cluster analysis, something like kmeans(rol, 5) but I can have long time series which are complex and I would have to manually pick the number of clusters.

Finding cyclic function from points with more than 2 indep. variales

I am doing my master thesis in Electrical engineering about the impact of the humidity and
temperature on power consumption
I have a problem that is related to statistics, numerical methods and mathematics topics
I have real data for one year (year 2000)
Every day has 24 hours records for temperature, humidity, power consumption
So, the total points for one parameter, for example, temperature is 24*366 = 8784 points
I classified the pattern of the power to three patterns:
Daily, seasonally and to cover the whole year
The aim is to find a mathematical model of the following form:
P = f ( T , H , t , date )
Where,
P = power consumption,
T = temperature,
t = time in hours from 1 to 24,
date = the date number in the year from 1 to 366 ( or date number in a month from 1 to 31)
I started drawing in Matlab program a sample day, 1st August showing the effect of time,
humidity and temperature on power consumption::
http://www7.0zz0.com/2010/12/11/23/264638558.jpg
Then, I make the analysis wider to see what changes happened when drawing this day with the next day:
http://www7.0zz0.com/2010/12/11/23/549837601.jpg
After that I make it wider and include the 1st week of august:
http://www7.0zz0.com/2010/12/11/23/447153078.jpg
Then, the whole month, august:
http://www7.0zz0.com/2010/12/12/00/120820248.jpg
Then, starting from January, I plot power and temperature for 1st six months without
humidity (only for scaling):
http://www7.0zz0.com/2010/12/12/00/908911392.jpg
with humidity :
http://www7.0zz0.com/2010/12/12/00/102651717.jpg
Then, the whole year plot without humidity:
( P,T,H have constant values but I separate H only for scaling since H values are too much higher than P and H and that cause shrinking of the plot making small plots for P and T)
http://www7.0zz0.com/2010/12/11/23/290259320.jpg
and finally with humidity:
http://www7.0zz0.com/2010/12/11/23/842530863.jpg
The reason I have plotted these figures is to follow the behaviors of all parameters. How P is changing with respect to Temperature, Humidity, and time in hours and time in day number.
It is clear that these figures represent cyclic behavior but this behavior is not
constant. It is starting to increase and then decrease during the whole year.
For example the behavior of 1st January is almost the same as any other day in the year
but the difference is in shifting up or down, left or right.
Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal as seen in the following figure:
http://www7.0zz0.com/2010/12/12/00/153503144.jpg
I am not expert in statistics and numerical methods, and this matter now does not have relation with electrical engineering concept.
The results I am aiming to get are:
Specify the day number in the year from 1 to 366,
then specify the hour in that day,
temperature and humidity also will be specified.
All of these parameters are to be specified by the user
The result:
The mathematical model should be capable to find the power consumption in that specific hour of that day.
Then, the Power found from the model will be compared to the measured real power from the
data and if the values are very close to each other, then the model will be accurate and
accepted.
I am sorry for this long question. I actually read many papers, many helps but I could not
reach to the correct approach of how to find one unified model by following the curves
behavior from starting till the end of the year and also having more than one independent
variable has disturbed me a lot.
I hope this problem is not difficult for statistics and mathematics experts.
Any help will be highly appreciated,
Thanks in advance
Regards
About this:
"Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal"
Seems in local scale (several days/weeks order) temperature and humidity can be expressed as periodic train of Gaussians:
After such assumption we can model power consumption as superposition of temperature and humidity trains of Gaussians. Consider this opencalc spreadsheet chart:
in which f1 and f2 are train of gaussians (here only 4 peaks, but you may calculate as many as you need for data fitting) and f3 is superposition of these two trains,-
just (f12 + f22)1/2
However i don't know to what degree power consumption follows the addition of train of gaussians. You may invest time to explore this possibility.
Good luck!

Resources