Periodogram (TSA In R) can't find correct frequency - r

I'm trying to process a sinusoidal time series data set:
I am using this code in R:
library(readxl)
library(stats)
library(matplot.lib)
library(TSA)
Data_frame<-read_excel("C:/Users/James/Documents/labssin2.xlsx")
# compute the Fourier Transform
p = periodogram(Data_frame$NormalisedVal)
dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 5)
# display the 2 highest "power" frequencies
top2
time = 1/top2$f
time
However when examining the frequency spectrum the frequency (which is in Hz) is ridiculously low ~ 0.02Hz, whereas it should have one much larger frequency of around 1Hz and another smaller one of 0.02Hz (just visually assuming this is a sinusoid enveloped in another sinusoid).
Might be a rather trivial problem, but has anyone got any ideas as to what could be going wrong?
Thanks in advance.
Edit 1: Using
result <- abs(fft(df$Data_frame.NormalisedVal))
Produces what I am expecting to see.
Edit2: As requested, text file with the output to dput(Data_frame).
http://m.uploadedit.com/bbtc/1553266283956.txt

The periodogram function returns normalized frequencies in the [0,0.5] range, where 0.5 corresponds to the Nyquist frequency, i.e. half your sampling rate. Since you appear to have data sampled at 60Hz, the spike at 0.02 would correspond to a frequency of 0.02*60 = 1.2Hz, which is consistent with your expectation and in the neighborhood of what can be seen in the data your provided (the bulk of the spike being in the range of 0.7-1.1Hz).
On the other hand, the x-axis on the last graph you show based on the fft is an index and not a frequency. The corresponding frequency should be computed according to the following formula:
f <- (index-1)*fs/N
where fs is the sampling rate, and N is the number of samples used by the fft. So in your graph the same 1.2Hz would appear at an index of ~31 assuming N is approximately 1500.
Note: the sampling interval in the data you provided is not quite constant and may affect the results as both periodogram and fft assume a regular sampling interval.

Related

Preferentially Sampling Based upon Value Size

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

How to eliminate zeros in simulated data from rnorm function

I have a large set of high frequency data of wind. I use this data in a model to calculate gas exchange between atmosphere and water. I am using the average wind of a 10-day series of measurements to represent gas exchange at a given time. Since the wind is an average value from a 10-day series I want to apply the error to the output by adding the error to the input:
#fictional time series, manually created by me.
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
I then create 100 values around the mean and sd of the wind vector:
df <- as.data.frame(mapply(rnorm,mean=mean(wind),sd=sd(wind),n=100))
The standard deviation generates negative values. If these are run in the gas exchange model I get disproportionately large error simply because wind speed can't be negative and the model is not constructed to be capable to run with negative wind measurements. I have been suggested to log transform the raw data and run the rnorm() with logged values, and then transform back. But since there are several zeros in the data (0=no wind) I can't simply log the values. Hence I use the log(x+c) method:
wind.log <- log(wind+1)
df.log <- as.data.frame(mapply(rnorm,
mean=mean(wind.log),
sd=sd(wind.log),n=100))
However, I will need to convert values back to actual wind measurements before running them in the model.
This is where it gets problematic, since I will need to use exp(x)-c to convert values back and then I end up with negative values again.
Is there a way to work around this without truncating the 0's and screwing up the generated distribution around the mean?
My only alternative is otherwise is to calculate gas exchange directly at every given time point and generate a distribution from that, those values would never be negative or = 0 and can hence be log-transformed.
Suggestion: use a zero-inflated/altered model, where you generate some proportion of zero values and draw the rest from a log-normal distribution(to make sure you don't get negative values):
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
prop_nonzero <- mean(wind>0)
lmean <- mean(log(wind[wind>0]))
lsd <- sd(log(wind[wind>0]))
n <- 500
vals <- rbinom(n, size=1,prob=prop_nonzero)*rlnorm(n,meanlog=lmean,sdlog=lsd)
Alternatively you could use a Tweedie distribution (as suggested by #aosmith), or fit a censored model to estimate the distribution of wind values that get measured as zero (assuming that the wind speed is never exactly zero, just too small to measure)

Understanding TSA::periodogram()

I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.

Adding seasonal variations to wind speed time series

Following up from an R blog which is interesting and quite useful to simulate the time series of an unknown area using its Weibull parameters.
Although this method gives a reasonably good estimate of time series as a whole it suffers a great deal when we look for seasonal changes. To account for seasonal changes I want to employ seasonal maximum wind speeds and carry out the time series synthesis such that the yearly distribution remains constant ie. shape and scale parameters (annual values).
I want to employ seasonal maximum wind speeds to the below code by using 12 different maximum wind speeds, one each for every month. This will allow greater wind speeds at certain month and lower in others and should even out the resultant time series.
The code follows like this:
MeanSpeed<-7.29 ## Mean Yearly Wind Speed at the site.
Shape=2; ## Input Shape parameter (yearly).
Scale=8 ##Calculated Scale Parameter ( yearly).
MaxSpeed<-17 (##yearly)
## $$$ 12 values of these wind speed one for each month to be used. The resultant time series should satisfy shape and scale parameters $$ ###
nStates<-16
nRows<-nStates;
nColumns<-nStates;
LCateg<-MaxSpeed/nStates;
WindSpeed=seq(LCateg/2,MaxSpeed-LCateg/2,by=LCateg) ## Fine the velocity vector-centered on the average value of each category.
##Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=Shape, scale=Scale); # Freqency distribution.
plot(wpdWind,type = "b", ylab= "frequency", xlab = "Wind Speed") ##Plot weibull probability distribution.
norm_wpdWind<-wpdWind/sum(wpdWind); ## Convert weibull/Gaussian distribution to normal distribution.
## Correlation between states (Matrix G)
g<-function(x){2^(-abs(x))} ## decreasing correlation function between states.
G<-matrix(nrow=nRows,ncol=nColumns)
G <- row(G)-col(G)
G <- g(G)
##--------------------------------------------------------
## iterative process to calculate the matrix P (initial probability)
P0<-diag(norm_wpdWind); ## Initial value of the MATRIX P.
P1<-norm_wpdWind; ## Initial value of the VECTOR p.
## This iterative calculation must be done until a certain error is exceeded
## Now, as something tentative, I set the number of iterations
steps=1000;
P=P0;
p=P1;
for (i in 1:steps){
r<-P%*%G%*%p;
r<-as.vector(r/sum(r)); ## The above result is in matrix form. I change it to vector
p=p+0.5*(P1-r)
P=diag(p)}
## $$ ----Markov Transition Matrix --- $$ ##
N=diag(1/as.vector(p%*%G));## normalization matrix
MTM=N%*%G%*%P ## Markov Transition Matrix
MTMcum<-t(apply(MTM,1,cumsum));## From the MTM generated the accumulated
##-------------------------------------------
## Calculating the series from the MTMcum
##Insert number of data sets.
LSerie<-52560; Wind Speed every 10 minutes for a year.
RandNum1<-runif(LSerie);## Random number to choose between states
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
StatesSeries=InitialState;
## Initallise----
## The next state is selected to the one in which the random number exceeds the accumulated probability value
##The next iterative procedure chooses the next state whose random number is greater than the cumulated probability defined by the MTM
for (i in 2:LSerie) {
## i has to start on 2 !!
State=min(which(RandNum1[i]<=MTMcum[State,]));
## if (is.infinite (State)) {State = 1}; ## when the above condition is not met max -Inf
StatesSeries=c(StatesSeries,State)}
RandNum2<-runif(LSerie); ## Random number to choose between speeds within a state
SpeedSeries=WindSpeed[StatesSeries]-0.5+RandNum2*LCateg;
##where the 0.5 correction is needed since the the WindSpeed vector is centered around the mean value of each category.
print(fitdistr(SpeedSeries, 'weibull')) ##MLE fitting of SpeedSeries
Can anyone suggest where and what changes I need to make to the code?
I don't know much about generating wind speed time series but maybe those guidelines can help you improve your code readability/reusability:
#1 You probably want to have a function which will generate a wind speed time
serie given a number of observations and a seasonal maximum wind speed. So first try to define your code inside a block like this one:
wind_time_serie <- function(nobs, max_speed){
#some code here
}
#2 Doing so, if it seems that some parts of your code are useful to generate wind speed time series but aren't about wind speed time series, try to put them into functions (e.g. the part you compute norm_wpdWind, the part you compute MTMcum,...).
#3 Then, the part of your code at the beginning when your define global variable should disappear and become default arguments in functions.
#4 Avoid using endline comments when your line is already long and delete the ending semicolumns.
#This
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
#Would become this:
#Assumes that the initial state is 1 (this must be changed when concatenating days)
State<-InitialState<-1
Then your code should be more reusable / readable by other people. You have an example below of those guidelines applied to the rnorm part:
norm_distrib<-function(maxSpeed, states = 16, shape = 2, scale = 8){
#Fine the velocity vector-centered on the average value of each category.
LCateg<-maxSpeed/states
WindSpeed=seq(LCateg/2,maxSpeed-LCateg/2,by=LCateg)
#Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=shape, scale=scale)
#Convert weibull/Gaussian distribution to normal distribution.
return(wpdWind/sum(wpdWind))
}
#Plot normal distribution with the max speed you want (e.g. 17)
plot(norm_distrib(17),type = "b", ylab= "frequency", xlab = "Wind Speed")

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

Resources