I am struggling to understand why EMA (Exponential Moving Average) is different in these 2 cases.
options(scipen = 10)
library(quantmod)
getSymbols("AAPL", src = "google")
data1 <- EMA(AAPL[, "AAPL.Close"])
data2 <- EMA(tail(AAPL[, "AAPL.Close"], n = 10))
result <- data.frame(tail(data1, n = 1), tail(data2, n = 1))
In the first EMA call i supply as parameter the whole AAPL sample. In the second EMA call i supply only the minimum amount of data to calculate EMA for the last date. If i compare calculated value for the last day, they are different.
In concrete, result[1,1] is 152.907 and result[1,2] is 152.623. Why is this happening? I would expect both numbers to be the same, since EMA is not cumulative.
That is because the exponential decay has 'infinite' history: it decays, but never completely goes away.
So the set of weights is different between your two use cases, and hence so is the result. There is a decent recent series on MAs for smoothing price series:
part one
part two
It should make it clear that what you were expecting only holds for SMA which does in fact go to zero weights.
Related
I'm trying to process a sinusoidal time series data set:
I am using this code in R:
library(readxl)
library(stats)
library(matplot.lib)
library(TSA)
Data_frame<-read_excel("C:/Users/James/Documents/labssin2.xlsx")
# compute the Fourier Transform
p = periodogram(Data_frame$NormalisedVal)
dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 5)
# display the 2 highest "power" frequencies
top2
time = 1/top2$f
time
However when examining the frequency spectrum the frequency (which is in Hz) is ridiculously low ~ 0.02Hz, whereas it should have one much larger frequency of around 1Hz and another smaller one of 0.02Hz (just visually assuming this is a sinusoid enveloped in another sinusoid).
Might be a rather trivial problem, but has anyone got any ideas as to what could be going wrong?
Thanks in advance.
Edit 1: Using
result <- abs(fft(df$Data_frame.NormalisedVal))
Produces what I am expecting to see.
Edit2: As requested, text file with the output to dput(Data_frame).
http://m.uploadedit.com/bbtc/1553266283956.txt
The periodogram function returns normalized frequencies in the [0,0.5] range, where 0.5 corresponds to the Nyquist frequency, i.e. half your sampling rate. Since you appear to have data sampled at 60Hz, the spike at 0.02 would correspond to a frequency of 0.02*60 = 1.2Hz, which is consistent with your expectation and in the neighborhood of what can be seen in the data your provided (the bulk of the spike being in the range of 0.7-1.1Hz).
On the other hand, the x-axis on the last graph you show based on the fft is an index and not a frequency. The corresponding frequency should be computed according to the following formula:
f <- (index-1)*fs/N
where fs is the sampling rate, and N is the number of samples used by the fft. So in your graph the same 1.2Hz would appear at an index of ~31 assuming N is approximately 1500.
Note: the sampling interval in the data you provided is not quite constant and may affect the results as both periodogram and fft assume a regular sampling interval.
I am using the acf function in Time Series Analysis and have confusion understanding the lag.max argument in it.
The help for the function gives the following explanation for lag.max-
lag.max: maximum lag at which to calculate the acf. Default is
10*log10(N/m) where N is the number of observations and m the
number of series. Will be automatically limited to one less
than the number of observations in the series.
What's m or the number of series?
Say I have a time series having monthyl data for the past 34 months and I need to make a prediction for the next month (or the 35th month).
In this case N will be 34, but what should be m so that I can calculate "lag.max" parameter?
Thanks!
m is the dimension of your data. So it just matters if you have a multidimensional time series. In your case, as I understand from your question, m=1.
N<-200
a<-1:N
b<-1:N
acf(a)
# m=1
# lag.max = 10*log10(N/1) = 23"
df<-data.frame(a,b)
acf(df)
# m=2
# lag.max = 10*log10(N/2) = 20"
Following up from an R blog which is interesting and quite useful to simulate the time series of an unknown area using its Weibull parameters.
Although this method gives a reasonably good estimate of time series as a whole it suffers a great deal when we look for seasonal changes. To account for seasonal changes I want to employ seasonal maximum wind speeds and carry out the time series synthesis such that the yearly distribution remains constant ie. shape and scale parameters (annual values).
I want to employ seasonal maximum wind speeds to the below code by using 12 different maximum wind speeds, one each for every month. This will allow greater wind speeds at certain month and lower in others and should even out the resultant time series.
The code follows like this:
MeanSpeed<-7.29 ## Mean Yearly Wind Speed at the site.
Shape=2; ## Input Shape parameter (yearly).
Scale=8 ##Calculated Scale Parameter ( yearly).
MaxSpeed<-17 (##yearly)
## $$$ 12 values of these wind speed one for each month to be used. The resultant time series should satisfy shape and scale parameters $$ ###
nStates<-16
nRows<-nStates;
nColumns<-nStates;
LCateg<-MaxSpeed/nStates;
WindSpeed=seq(LCateg/2,MaxSpeed-LCateg/2,by=LCateg) ## Fine the velocity vector-centered on the average value of each category.
##Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=Shape, scale=Scale); # Freqency distribution.
plot(wpdWind,type = "b", ylab= "frequency", xlab = "Wind Speed") ##Plot weibull probability distribution.
norm_wpdWind<-wpdWind/sum(wpdWind); ## Convert weibull/Gaussian distribution to normal distribution.
## Correlation between states (Matrix G)
g<-function(x){2^(-abs(x))} ## decreasing correlation function between states.
G<-matrix(nrow=nRows,ncol=nColumns)
G <- row(G)-col(G)
G <- g(G)
##--------------------------------------------------------
## iterative process to calculate the matrix P (initial probability)
P0<-diag(norm_wpdWind); ## Initial value of the MATRIX P.
P1<-norm_wpdWind; ## Initial value of the VECTOR p.
## This iterative calculation must be done until a certain error is exceeded
## Now, as something tentative, I set the number of iterations
steps=1000;
P=P0;
p=P1;
for (i in 1:steps){
r<-P%*%G%*%p;
r<-as.vector(r/sum(r)); ## The above result is in matrix form. I change it to vector
p=p+0.5*(P1-r)
P=diag(p)}
## $$ ----Markov Transition Matrix --- $$ ##
N=diag(1/as.vector(p%*%G));## normalization matrix
MTM=N%*%G%*%P ## Markov Transition Matrix
MTMcum<-t(apply(MTM,1,cumsum));## From the MTM generated the accumulated
##-------------------------------------------
## Calculating the series from the MTMcum
##Insert number of data sets.
LSerie<-52560; Wind Speed every 10 minutes for a year.
RandNum1<-runif(LSerie);## Random number to choose between states
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
StatesSeries=InitialState;
## Initallise----
## The next state is selected to the one in which the random number exceeds the accumulated probability value
##The next iterative procedure chooses the next state whose random number is greater than the cumulated probability defined by the MTM
for (i in 2:LSerie) {
## i has to start on 2 !!
State=min(which(RandNum1[i]<=MTMcum[State,]));
## if (is.infinite (State)) {State = 1}; ## when the above condition is not met max -Inf
StatesSeries=c(StatesSeries,State)}
RandNum2<-runif(LSerie); ## Random number to choose between speeds within a state
SpeedSeries=WindSpeed[StatesSeries]-0.5+RandNum2*LCateg;
##where the 0.5 correction is needed since the the WindSpeed vector is centered around the mean value of each category.
print(fitdistr(SpeedSeries, 'weibull')) ##MLE fitting of SpeedSeries
Can anyone suggest where and what changes I need to make to the code?
I don't know much about generating wind speed time series but maybe those guidelines can help you improve your code readability/reusability:
#1 You probably want to have a function which will generate a wind speed time
serie given a number of observations and a seasonal maximum wind speed. So first try to define your code inside a block like this one:
wind_time_serie <- function(nobs, max_speed){
#some code here
}
#2 Doing so, if it seems that some parts of your code are useful to generate wind speed time series but aren't about wind speed time series, try to put them into functions (e.g. the part you compute norm_wpdWind, the part you compute MTMcum,...).
#3 Then, the part of your code at the beginning when your define global variable should disappear and become default arguments in functions.
#4 Avoid using endline comments when your line is already long and delete the ending semicolumns.
#This
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
#Would become this:
#Assumes that the initial state is 1 (this must be changed when concatenating days)
State<-InitialState<-1
Then your code should be more reusable / readable by other people. You have an example below of those guidelines applied to the rnorm part:
norm_distrib<-function(maxSpeed, states = 16, shape = 2, scale = 8){
#Fine the velocity vector-centered on the average value of each category.
LCateg<-maxSpeed/states
WindSpeed=seq(LCateg/2,maxSpeed-LCateg/2,by=LCateg)
#Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=shape, scale=scale)
#Convert weibull/Gaussian distribution to normal distribution.
return(wpdWind/sum(wpdWind))
}
#Plot normal distribution with the max speed you want (e.g. 17)
plot(norm_distrib(17),type = "b", ylab= "frequency", xlab = "Wind Speed")
I've been stuck working on this for a little while now and I find it very hard to believe that this isn't an inbuilt function or that someone hasn't dealt with this before now.
The function I want to run should compare two columns of a dataframe by until they "best correlation is found. The data I am using is from two scientific instruments and their sampling/averaging times differ, which is why I want to shift the data.
date associated with only one element will be adjusted.
if correlation of data + x seconds is > that current correlation
increase current
note increasing
else if correlation of data - x seconds ix > than current correlation
decrease current date/time
not decreasing
end if
while correlation of data + x seconds is > than current correlation
increase current date/time by x seconds
end while
while correlation of data - seconds is > than current correlation
decrease current date/time by x seconds
end while
If there is a function that will do this great if not I will provide additional info + code
This is what my current code structure is.
Date is POXISct 'GMT', Dusttrak is numeric, CO is numeric, color is a number I have created from time to give me a colored time series
I am currently using rcorr to find the correlation but date has been an issue, so I will either need to convert from date to numeric and back afterwards.
let's use synthetic data, as you have only pasted in an image:
set.seed(100)
x = rnorm(100)
y = rnorm(100)
now we use ccf:
z <- ccf(x,y, plot = F) #don't want plot
z is a list with our results, which we can do a little subsetting on to get our max lag:
bestval = which.max(z$acf)
z$lag[bestval] #our lag
16
For the time series, it gets a little harder - if you don't have uniform time steps in your rows, you might have to do some normalisation.
I am trying to generate a series of wait times for a Markov chain where the wait times are exponentially distributed numbers with rate equal to one. However, I don't know the number of transitions of the process, rather the total time spent in the process.
So, for example:
t <- rexp(100,1)
tt <- cumsum(c(0,t))
t is a vector of the successive and independent waiting times and tt is a vector of the actual transition time starting from 0.
Again, the problem is I don't know the length of t (i.e. the number of transitions), rather how much total waiting time will elapse (i.e. the floor of last entry in tt).
What is an efficient way to generate this in R?
The Wikipedia entry for Poisson process has everything you need. The number of arrivals in the interval has a Poisson distribution, and once you know how many arrivals there are, the arrival times are uniformly distributed within the interval. Say, for instance, your interval is of length 15.
N <- rpois(1, lambda = 15)
arrives <- sort(runif(N, max = 15))
waits <- c(arrives[1], diff(arrives))
Here, arrives corresponds to your tt and waits corresponds to your t (by the way, it's not a good idea to name a vector t since t is reserved for the transpose function). Of course, the last entry of waits has been truncated, but you mentioned only knowing the floor of the last entry of tt, anyway. If he's really needed you could replace him with an independent exponential (bigger than waits[N]), if you like.
If I got this right: you want to know how many transitions it'll take to fill your time interval. Since the transitions are random and unknown, there's no way to predict for a given sample. Here's how to find the answer:
tfoo<-rexp(100,1)
max(which(cumsum(tfoo)<=10))
[1] 10
tfoo<-rexp(100,1) # do another trial
max(which(cumsum(tfoo)<=10))
[1] 14
Now, if you expect to need to draw some huge sample, e.g. rexp(1e10,1), then maybe you should draw in 'chunks.' Draw 1e9 samples and see if sum(tfoo) exceeds your time threshold. If so, dig thru the cumsum . If not, draw another 1e9 samples, and so on.