I have a time series and would like to find the period that has the lowest contiguous variability, i.e. the period in which the rolling SD hovers around the minimum for the longest consecutive time steps.
test=c(10,12,14,16,13,13,14,15,15,14,16,16,16,16,16,16,16,15,14,15,12,11,10)
rol=rollapply(x, width=4, FUN=sd)
rol
I can easily see from the data or the graph that the longest period with the lowest variability start at t=11. Is there a function that can help me find this period of continued low variability, perhaps trying automatically different size for the rolling window? I am not interested in finding the time step with the lowest SD, but a period where this low SD is more consistent than others.
All I can think for now is looking at the difference between rol[i]-rol[i+1], looping through the vector and use a counter to find periods of consecutive low values of SD. I was also thinking of using cluster analysis, something like kmeans(rol, 5) but I can have long time series which are complex and I would have to manually pick the number of clusters.
Related
I have two variables, x and y, measured at one minute intervals for over two years. The average daily values of x and y are almost 90% correlated. However, when I analyze x and y in one minute intervals they are only 50% correlated. How can I detect the time interval at which this correlation becomes 90%? Ideally I'd like to do this in R.
I'm new to statistics/econometrics, so my apologies if this question is very basic!
I'm not quite sure what you are asking here. What do you mean by x and y being 90 "percent" correlated? Do you mean you get a correlation coefficient of .9?
Beyond this clarification you can absolutely have a situation where the average of 2 variables is more correlated than any individual subset of the data. In other words order matters, so the correlation of the average is not the average of the correlation. For example, this R code shows if we took 3 measurements each hour for 2 hours (6 measurements total), the overall correlation is .5, while the correlation of the average hourly measure is a perfect 1. Essentially when you take the correlation of averages you are effectively removing the impact of the order your measurement values are distributed within the interval you are averaging over, which ends up actually being very important when taking the correlations. Let me know if I missed something about your question though.
X=c(1,2,3,4,5,6)
Y=c(3,2,1,6,5,4)
cor(X,Y)
HourAvgX=c(mean(X[1:3]),mean(X[4:6]))
HourAvgY=c(mean(Y[1:3]),mean(Y[4:6]))
cor(HourAvgX,HourAvgY)
I have a plot with time as a PosixCt object on the x-axis and a dependent variable "ODBA" on the y-axis. The experiment was a 600-second trial. How do I calculate the total time in seconds that ODBA was below a certain threshold (e.g. 0.25)?
We can use sum
sum(as.numeric(format(time, "%S"))[ODBA > 0.25])
I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.
I'm trying to build a time series model based on a cumulative variable that never decreases.
I'm interested in knowing when the observable will reach a certain value (i.e., when it will intersect with the blue line in the image below).
The orange line is fixed to the last known data point and increases based on the average of the last 5 observables.
The red line is not fixed and represents a linear fit based on the last 5 observables. This seems problematic because in Time Period 108 in the graph, the predicted value is less than the observable in the previous time period, which will never physically happen.
The green line is not fixed and represents a linear fit based on all observables.
I'm wondering if someone can suggest an alternative/better approach to modelling this type of situation.
I agree with #Imo.
I would suggest to following:
You can estimate the linear increase per time period, using all your data, or an appropriate subset (last 5 observations). Then, predict the values for the out-of-sample period, using the observation in time period 107.
If, for example, your increase per time period is 20 (dx/dt), and your last known observation at time T has the value of 200 (x), then x would be 220 at time T + 1.
Hence, you would apply your solution in the green line, but shift a bit up to start at your last observation.
I am trying to generate a series of wait times for a Markov chain where the wait times are exponentially distributed numbers with rate equal to one. However, I don't know the number of transitions of the process, rather the total time spent in the process.
So, for example:
t <- rexp(100,1)
tt <- cumsum(c(0,t))
t is a vector of the successive and independent waiting times and tt is a vector of the actual transition time starting from 0.
Again, the problem is I don't know the length of t (i.e. the number of transitions), rather how much total waiting time will elapse (i.e. the floor of last entry in tt).
What is an efficient way to generate this in R?
The Wikipedia entry for Poisson process has everything you need. The number of arrivals in the interval has a Poisson distribution, and once you know how many arrivals there are, the arrival times are uniformly distributed within the interval. Say, for instance, your interval is of length 15.
N <- rpois(1, lambda = 15)
arrives <- sort(runif(N, max = 15))
waits <- c(arrives[1], diff(arrives))
Here, arrives corresponds to your tt and waits corresponds to your t (by the way, it's not a good idea to name a vector t since t is reserved for the transpose function). Of course, the last entry of waits has been truncated, but you mentioned only knowing the floor of the last entry of tt, anyway. If he's really needed you could replace him with an independent exponential (bigger than waits[N]), if you like.
If I got this right: you want to know how many transitions it'll take to fill your time interval. Since the transitions are random and unknown, there's no way to predict for a given sample. Here's how to find the answer:
tfoo<-rexp(100,1)
max(which(cumsum(tfoo)<=10))
[1] 10
tfoo<-rexp(100,1) # do another trial
max(which(cumsum(tfoo)<=10))
[1] 14
Now, if you expect to need to draw some huge sample, e.g. rexp(1e10,1), then maybe you should draw in 'chunks.' Draw 1e9 samples and see if sum(tfoo) exceeds your time threshold. If so, dig thru the cumsum . If not, draw another 1e9 samples, and so on.